This chapter, and the last one we will cover in this book series, covers consistency and consensus. It focuses on the CAP theorem. We will explore all levels of consistency guarantees.
Consistency guarantee levels
There are three grades of consistency guarantees:
- Weak: Reads may or may not see writes. Often seen in real-time (VOIP, chat) or in-memory solutions like Memcached.
- Eventual: Reads will eventually converge to see the latest write. Data replication is asynchronous. BASE NoSQL databases use this model most often. Highly available systems like email use eventual consistency. When convergence occurs is arbitrary and one of the more tenuous aspects of this consistency guarantee.
- Strong: Reads always see writes instantly. Data replication is synchronous. ACID relational databases use this model.
- Causality: Something happens after another. Questions precede answers. Causal ordering is often a comparison of a limited set of operations rather than the total ordering of all operations in a system. This is the practical limit of strict consistency in distributed systems. Timestamps in a replication log are an easy way to achieve a causally consistent dataset.
- Linearizability: The strictest definition of consistency as it relates to the CAP theorem. Applications do not worry about replicas and maintain the illusion of a copy of the data. It’s a recency guarantee that states there is an atomic operation that happens in the database where the data can never revert to the old value after a particular point in time. It is said to evolve data linearly. Apache ZooKeeper uses linearizability for leader election to ensure you cannot promote two nodes to be the same leader in single leader replication schemes.
One final note on linearizability: single leader replication is your only practical strategy for achieving a highly available system with strong guarantees. Other replication strategies, such as multi-leader and leaderless replication, utilize asynchronous replication. Because strong consistency requires synchronous replication, we can only rely on single leader replication where the leader handles both reads and writes, and the replicas simply act as fault-tolerant backups.
This limits our use cases for replicated, linearizable systems. Since multi-leader and leaderless replication are out, you cannot distribute your data across multiple data centers. This is because each data center needs to have its own leader.
In theory, you could have multiple data centers that all have to access a central data center where your single leader lives, but if any data centers fail you risk halting all operations. This could crush performance and grind requests to a halt.
The book covers quite a bit more on ordering and linearizability than I would have liked because Kleppmann admits that linearizable consistency is not practical in real-world distributed systems. And yet, there is quite a bit he covers on things like total order broadcast for such systems that we will likely never need to use.
To have a consistent view of the data across distributed systems, all of those nodes have to agree on the state of the data. This is known as consensus. One classical way of achieving consensus is through something called a two-phase commit. It’s basically like marriage vows: a coordinator (officiant) asks each participant if they are willing to commit to the transaction (marriage). That is phase one, the preparatory phase. If every participant says yes (“I do”), then the commit phase executes and the transaction has to complete.
These kinds of algorithms are often implemented in relational databases for distributed systems. You don’t see this covered much in NoSQL solutions. And while there are benefits to this system to increase atomicity and durability, there is a great risk for performance. If the coordinator goes down and the log record of the 2PC is lost, you could potentially lock the database forever. As we’ve seen before, we never want a single point of failure. the two-phase commit algorithm rests heavily on the health of the coordinator application to ensure a smooth transaction across a distributed system. How might we mitigate this?
Better consensus algorithms have emerged over the years. The two most commonly-referenced algorithms, both in this book and externally, are Paxos and Raft. They are total order broadcast algorithms so they deliver messages exactly once, in order, to all nodes. In other words, this is a repeated barrage of consensus decisions for each node to decide on what the current state of the data is.
A common use case for these algorithms is with leader election. You need to gather a majority vote amongst nodes as to who is the new leader when the old leader dies or goes offline. While this helps ensure that two nodes can’t claim to be the leader at once, they do slow down the application significantly. Consensus protocols require synchronous replication of votes which is the slowest way to provide consistency of decisions.
Further, they require a majority voting structure. This means you can’t have an even number of nodes since you could have a tie. This also means you can’t dynamically add or remove nodes into your cluster. This is because the number of nodes you have at any given moment will have to flip from odd to even. In the end, you have a very rigid and brittle system for consensus that does not support the dynamic nature of modern distributed systems.
ZooKeeper to the rescue
All of these systems act as in-memory databases that hold a small amount of information about all nodes. Think of them as an index card of emergency numbers that are distributed to every node. These coordinator services ensure that a tiny amount of information is embedded on every node so that if a decision needs to be made quickly, consensus can be reached without a lot of heavy back-and-forths over the wire.
In fact, tools like ZooKeeper will purposefully make small decisions across a tiny number of nodes to ensure the voting is fast. If a leader dies, for example, it will specifically only target three to five other nodes for a vote, rather than rely on quorum from the entire cluster.
The other benefit of a tool like ZooKeeper is it provides a method for service discovery in partitioned systems as well. It acts sort of like a load balancer in that it can route requests to the correct node or cluster to reach a particular service. In doing so, it can also act as a heartbeat service to check for the membership status of the nodes in its jurisdiction.
If there’s anything I’ve learned over the last several weeks, ZooKeeper is the swiss army knife of distributed system coordination. If you’re looking for a tool that handles consensus, failure detection, membership, and fault tolerance, this should be the first one for you to grab in your toolchain.
ZooKeeper is not a silver bullet. Multi-leader and leaderless replication don’t require global consensus, and thus don’t need ZooKeeper. As we discussed in a previous chapter, these replication schemes are the most common in distributed systems. These are also the most common schemes to use NoSQL databases. Thus, you won’t see ZooKeeper being used outside of distributed RDBMSs.
This wraps up my series on distributed systems following this book. There are quite a few more chapters that are not covered here. I’d encourage you to check out more resources if you’re interested:
- Grokking the system design interview is a way to practice the things learned in this series
- Grokking the advanced system design interview offers even more practice and more closely resembles the first principles tools used throughout this series
- This system design primer includes flashcards to provide spaced repetition learning of the concepts we’ve discussed
- This channel and this series offer an excellent selection of videos on distributed systems design
- High scalability is the gold standard for cutting-edge news and research in practical distributed systems. Consider subscribing!
Get the FREE UI crash course
Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.