Fault tolerance

This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.

The self-proclaimed pessimistic chapter of the book, chapter 8 covers fault tolerance in distributed systems. Let us briefly introduce why things break in an internet-scale system:

  • Commodity machines break. Since internet applications are about scaling out instead of up, each node in the cluster is cheap. Cheap things break more frequently.
  • The Internet Protocol is unreliable. There is no guarantee that things will move as intended. Everything is sent over the wire in packets. Traditional telephony networks send everything over in circuits. This lack of a reliable, consistent connection presents problems. The request could be lost or queued. The node accepting the request could stop responding or fail altogether. Worse, your request could have made it all the way to the node and back but was lost on the way back to you or will arrive later than you want it to.
    • We add in TCP to make it more reliable. The downside is there are variable network delays since that queuing and reliability slows the system down.
    • You can trade TCP for UDP if you need more speed. If you cannot afford delays in data and can afford less reliability, consider using UDP. A Zoom call is a good example of when UDP is a better protocol than TCP.

This is in contrast to vertically-scaled systems that rely on beefy supercomputers. These systems often act as if they were single-computer applications. Therefore, we will not be covering this in the book or in the notes.

The crucial thing to internalize is that faults happen in the real world all of the time. The best anecdote from the book is how a shark bit some network cables and took down parts of Amazon EC2. Even a company as powerful as Amazon cannot prevent the almighty shark from screwing up network connectivity.

How to deal with faults

If we know faults will happen, then it is vital to be proactive in dealing with them.

Error handling is a simple and effective way of staying on top of faults. Keeping your customers aware of issues may only require a browser refresh to correct the problem.

Heartbeats help detect faulty or down nodes. A load balancer (like HAProxy or Nginx) will periodically ping nodes to ensure they are up to send them traffic.

Leader election is a strategy to elect new write replicas when leaders are down in replicated clusters.

Timeouts are the gate to try all of the above strategies. For customers, the only way to reliably detect faults is to inform them that too much time has elapsed to process their request. In that time, you will have tried the things above. Once the timeout has been exceeded, it is safe to assume that a fault has occurred.

How do you choose a sensible timeout? Measure your response times and variability. Then, choose a number. Unfortunately, there is no standard timeout interval that will work for everyone. Apache Cassandra uses the Phi failure-accrual algorithm to adjust gossip protocols when they check with nodes to determine if they are down. TCP does its own retransmission, but there is not much here for you to take away for your own applications.


Timezones are hard. Dealing with time, in general, is difficult. All computers have clocks, but they are rarely in sync with each other. How do you ensure that the time or duration of an activity is accurate?

Sync them. Things called NTP servers help ensure that the clock you see on your computer is actually correct. If it isn’t, the NTP server will send back the correct time and update your clock. The problem is, your clock could hop backward. If you sent a request at 10:00:01 and it was returned at 10:00:00, you have a problem. That is why these kinds of clocks, called time-of-day clocks, are not reliable for measuring time duration.

Count monotonically. To safeguard against a standard clock, a monotonic clock is also available on a computer. It is basically a giant timestamp counter that always counts upward, regardless of time syncing issues. You can still use an NTP server to adjust the monotonic clock, but it does not count backward. If there is a syncing issue, you simply delay when the next value is incremented. This is generally the preferred method for distributed systems.

Even with both mitigation strategies in place, no system is perfect. There is a network delay from the NTP server to your local clock, so they will never fully be in sync. Firewalls, leap seconds, virtual machines, and physical hardware clock malfunctions all contribute to perpetual inaccuracies between the actual time and the time posted on your machine. As long as you accept this premise, you can achieve reasonably high accuracy with the times you use it.

Strategies around timing issues

With all of these issues around keeping track of time, there are really only 3 solutions that are worth implementing:

  • The last write wins. Cassandra and Riak use this strategy to determine the most recent write to a database. But this suffers from the same basic problem: what is recent? If it is all relative anyway then you could still have two nodes that wrote to the DB with the same timestamp but one node’s clock is off.
  • Provide times with a delta. Google Spanner uses this with its TrueTime API, but good luck using it outside of Google.
  • Limit certain functions that would cause time delays and pauses. Garbage collection is a notably slow process that can hang a system. Ensuring there is enough memory to handle other functions when garbage collection grows unwieldy ensures some fault tolerance.

The message you must remember is that time is relative, and you cannot achieve real-time. Real-time, in the truest sense of the word, is reserved for embedded systems so things like your airbag can deploy when it needs to. Real-time, as it is used for web-scale, is a relative term for feeling instant even if it isn’t actually an instant operation.

Trusting a faulty system

Now that we know systems are unreliable, we need to know that the actors can be trusted. The book assumes they can be and that each player in the distributed system acts with honest intention. This is where the old security saying of trust but verify comes into play.

Quorums, as we discussed in the data replication chapter, are a way of obtaining votes from all nodes to figure out what the majority believes about the system during a discrepancy.

If you have 3 children and ask, “who stole the cookies from the cookie jar,” you’re likely to believe it is child 2 if both 1 and 3 point to 2.

What if one of them is lying? This is the basis for the Byzantine Generals’ problem. Systems that safeguard against lying actors are Byzantine fault-tolerant. They are only needed for mission-critical systems like aerospace or the blockchain. Internet web-scale systems do not require Byzantine fault-tolerance (usually in the form of 2/3rd majority vote rather than a simple majority vote), and we can improve fault tolerance with a few simple tricks:

  1. Require checksums. TCP and UDP offer these and ensure that corrupted packets must be retried to ensure data is transferred correctly.
  2. Permission, sanitize, validate, error-handle, and output-escape your data. This blog post beautifully summarizes a recipe to eliminate bugs from your code when ingesting data. Bugs are a source of lying since the system behaves unexpectedly through faulty code.
  3. Add redundancy to your NTP servers. Replication of time synchronization server checks ensures that the majority timestamp is the accurate and correct one.
  4. Build systems around crash recovery and partially synchronous system models. System models take many forms and shapes. One will not see the strictness of a completely synchronous, Byzantine fault-tolerant system in the real world. But you also don’t want a fully asynchronous model where even the slightest perturbance shuts down the system. Striking a middle ground with your system model is a sensible and useful real-world approach.

The next chapter will investigate algorithmic approaches to handling real-world fault tolerance with system models that handle crash recovery and strive for partially synchronous updates.

This quote at the end of the chapter summarizes how I would recommend the litany of strategies in this book to a more junior engineer:

If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so.

It is easy to look at a book like this and think of all the neat technologies you could use on your current project. Resisting the urge to do so and finding ways to say no is arguably a better approach. Most companies and most systems will never require the scale beyond a simple Postgres server. Barring safeguards around data replication, a simple RDBMS can scale to millions and millions of records without much else to power it.

Until we meet again in the next chapter, here is some additional reading material I enjoyed with this week’s chapter:

Get the FREE UI crash course

Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.

A new version of this app is available. Click here to update.