This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.

This chapter, and the last one we will cover in this book series, covers consistency and consensus. It focuses on the CAP theorem. We will explore all levels of consistency guarantees.

Consistency guarantee levels

There are three grades of consistency guarantees:

  • Weak: Reads may or may not see writes. Often seen in real-time (VOIP, chat) or in-memory solutions like Memcached.
  • Eventual: Reads will eventually converge to see the latest write. Data replication is asynchronous. BASE NoSQL databases use this model most often. Highly available systems like email use eventual consistency. When convergence occurs is arbitrary and one of the more tenuous aspects of this consistency guarantee.
  • Strong: Reads always see writes instantly. Data replication is synchronous. ACID relational databases use this model.
    • Causality: Something happens after another. Questions precede answers. Causal ordering is often a comparison of a limited set of operations rather than the total ordering of all operations in a system. This is the practical limit of strict consistency in distributed systems. Timestamps in a replication log are an easy way to achieve a causally consistent dataset.
    • Linearizability: The strictest definition of consistency as it relates to the CAP theorem. Applications do not worry about replicas and maintain the illusion of a copy of the data. It’s a recency guarantee that states there is an atomic operation that happens in the database where the data can never revert to the old value after a particular point in time. It is said to evolve data linearly. Apache ZooKeeper uses linearizability for leader election to ensure you cannot promote two nodes to be the same leader in single leader replication schemes.

One final note on linearizability: single leader replication is your only practical strategy for achieving a highly available system with strong guarantees. Other replication strategies, such as multi-leader and leaderless replication, utilize asynchronous replication. Because strong consistency requires synchronous replication, we can only rely on single leader replication where the leader handles both reads and writes, and the replicas simply act as fault-tolerant backups.

This limits our use cases for replicated, linearizable systems. Since multi-leader and leaderless replication are out, you cannot distribute your data across multiple data centers. This is because each data center needs to have its own leader.

In theory, you could have multiple data centers that all have to access a central data center where your single leader lives, but if any data centers fail you risk halting all operations. This could crush performance and grind requests to a halt.

The book covers quite a bit more on ordering and linearizability than I would have liked because Kleppmann admits that linearizable consistency is not practical in real-world distributed systems. And yet, there is quite a bit he covers on things like total order broadcast for such systems that we will likely never need to use.

On consensus

To have a consistent view of the data across distributed systems, all of those nodes have to agree on the state of the data. This is known as consensus. One classical way of achieving consensus is through something called a two-phase commit. It’s basically like marriage vows: a coordinator (officiant) asks each participant if they are willing to commit to the transaction (marriage). That is phase one, the preparatory phase. If every participant says yes (“I do”), then the commit phase executes and the transaction has to complete.

These kinds of algorithms are often implemented in relational databases for distributed systems. You don’t see this covered much in NoSQL solutions. And while there are benefits to this system to increase atomicity and durability, there is a great risk for performance. If the coordinator goes down and the log record of the 2PC is lost, you could potentially lock the database forever. As we’ve seen before, we never want a single point of failure. the two-phase commit algorithm rests heavily on the health of the coordinator application to ensure a smooth transaction across a distributed system. How might we mitigate this?

Better consensus algorithms have emerged over the years. The two most commonly-referenced algorithms, both in this book and externally, are Paxos and Raft. They are total order broadcast algorithms so they deliver messages exactly once, in order, to all nodes. In other words, this is a repeated barrage of consensus decisions for each node to decide on what the current state of the data is.

A common use case for these algorithms is with leader election. You need to gather a majority vote amongst nodes as to who is the new leader when the old leader dies or goes offline. While this helps ensure that two nodes can’t claim to be the leader at once, they do slow down the application significantly. Consensus protocols require synchronous replication of votes which is the slowest way to provide consistency of decisions.

Further, they require a majority voting structure. This means you can’t have an even number of nodes since you could have a tie. This also means you can’t dynamically add or remove nodes into your cluster. This is because the number of nodes you have at any given moment will have to flip from odd to even. In the end, you have a very rigid and brittle system for consensus that does not support the dynamic nature of modern distributed systems.

ZooKeeper to the rescue

Can you achieve a fault-tolerant total order broadcast with high performance? Apache ZooKeeper aims to do just that. Google Chubby, Etcd, and even Redis can act as a distributed lock manager.

All of these systems act as in-memory databases that hold a small amount of information about all nodes. Think of them as an index card of emergency numbers that are distributed to every node. These coordinator services ensure that a tiny amount of information is embedded on every node so that if a decision needs to be made quickly, consensus can be reached without a lot of heavy back-and-forths over the wire.

In fact, tools like ZooKeeper will purposefully make small decisions across a tiny number of nodes to ensure the voting is fast. If a leader dies, for example, it will specifically only target three to five other nodes for a vote, rather than rely on quorum from the entire cluster.

The other benefit of a tool like ZooKeeper is it provides a method for service discovery in partitioned systems as well. It acts sort of like a load balancer in that it can route requests to the correct node or cluster to reach a particular service. In doing so, it can also act as a heartbeat service to check for the membership status of the nodes in its jurisdiction.

If there’s anything I’ve learned over the last several weeks, ZooKeeper is the swiss army knife of distributed system coordination. If you’re looking for a tool that handles consensus, failure detection, membership, and fault tolerance, this should be the first one for you to grab in your toolchain.

ZooKeeper is not a silver bullet. Multi-leader and leaderless replication don’t require global consensus, and thus don’t need ZooKeeper. As we discussed in a previous chapter, these replication schemes are the most common in distributed systems. These are also the most common schemes to use NoSQL databases. Thus, you won’t see ZooKeeper being used outside of distributed RDBMSs.

Closing thoughts

This wraps up my series on distributed systems following this book. There are quite a few more chapters that are not covered here. I’d encourage you to check out more resources if you’re interested:

Get the FREE UI crash course

Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.

A new version of this app is available. Click here to update.