Like my series on Algorithms, I’ve decided that I need to really understand systems design.
I run hiring for multiple teams. Some of our questions revolve around systems design. How can I possibly ask these questions if I couldn’t answer them perfectly myself?
It’s only fair that I master these concepts. The book that everyone seems to agree with is the best book on this topic is Designing Data-Intensive Applications by Martin Kleppmann.
In the spirit of my last educational series, I will transcribe my raw notes over the next 7 weeks or so. I do this both to teach my readers and also to reinforce the concepts myself. As I recall from my course on learning how to learn, one of the best ways to learn is to teach others.
I’ve decided to craft a syllabus for myself as if I were both the teacher and the student to help keep me on track and reinforcing the concepts. If you have a similar learning plan or have questions about mine, please feel free to reach out.
- Week 1 (this week)
- Read Chapter 1: Scalability, Reliability, and Maintainability
- Practice: Design a scalable system to count events
- Explore, at a high-level, data tools in the space
- Week 2
- Read Chapter 2: Data Models & Query Languages
- Practice: Top K Problem
- Week 3
- Read Chapter 3: Storage & Retrieval
- Practice: Distributed cache
- Week 4
- Read Chapter 5: Replication
- Practice: Rate limiter
- Week 5
- Read Chapter 6: Partitioning
- Practice: Notification service
- Week 6
- Week 7
- Week 8
Notes on Chapter 1
Chapter one is really an overview chapter. It’s easy to read and easy to skim. There aren’t too many deep concepts here. I found lots of shortlists/rules of thumb to be all you need to focus on the basics.
When thinking of data storage, consider these 5 buckets:
- DB engines
- Row: good for OLTP, random reads, random writes, and write-heavy interactions (Postgres, MySQL)
- Column: good for OLAP, heavy reads on few columns, and few writes (HBase, Cassandra, Vertica, Bigtable)
- Document: good for data models with one-to-many relationships such as a CMS like a blog or video platform, or when you’re cataloging things like e-commerce products (MongoDB, CouchDB)
- Graph: good for data models with many-to-many relationships such as fraud detection between financial and purchase transactions or recommendation engines that associate many relationships to recommend products (Neo4J, Amazon Neptune)
- Key-Value: essentially a giant hash table/dictionary, good for in-memory session storage or caching e-commerce shopping cart data (Redis, DynamoDB)
- Search indexes (ElasticSearch, Solr)
- Caches (Memcached, Redis)
- Batch processing frameworks: good if data delay can be several hours or more and you can store all the data and process it later (Storm, Flink, Hadoop)
- Stream processing frameworks: good if data delay can only be several minutes and you need to store data in aggregate while processing data on-the-fly (Kafka, Samza, Flink)
Systems design in a nutshell
Approaching systems design falls into 5 steps:
- Functional requirements are designed to get us to think in APIs where we translate sentences into verbs for function names and nouns for input and return values.
- Non-functional requirements describe system qualities like scalability, availability, performance, consistency, durability, maintainability, and cost.
- High-level designs present the inbound and outbound data flow of the system.
- Detailed designs are for the specific components you want to focus on. Focus on the technologies you want to use for the data you want to store, transfer, process, and access.
- Bottlenecks & tradeoffs ensure we know how to find the limits of our designs and how we can balance solutions since there is no one singular correct answer to an architecture.
When asking questions regarding a product spec for a large-scale system, focus on these 5 categories of questions:
- Users: Who are they? What do they do with the data?
- Scale: How many requests/sec? Reads or writes? Where is the bottleneck? How many users are we supporting? How often/fast do users need data?
- Performance: When do things need to be returned/confirmed? What are the tolerance and SLAs for constraints?
- Cost: Are we optimizing for development cost (use OSS) or operational/maintenance cost (use cloud services)?
- CAP theorem: Partitioning is something you know you’ll have to account for with highly-scalable systems. So it may be easier to ask what is more valuable: consistency or availability?
If consistency is most important, consider an ACID database like Postgres or even a NewSQL database like CockroachDB or Google Spanner. If availability is most important, consider a BASE database like an eventually consistent NoSQL solution such as CouchDB, Cassandra, or MongoDB.
Even better is to use this diagram to map your concerns onto a pyramid. Given that you can only ever expect 2 of the 3 parts of the CAP theorem to be satisfied it might actually be better to ask which property is least important? If it’s…
- Consistency - most NoSQL solutions will work like Cassandra, CouchDB, and Amazon Dynamo
- Availability - some NoSQL solutions and some NewSQL solutions like Bigtable, HBase, MongoDB, Google Spanner, and Redis
- Partition tolerance - any relational or graph solution like Postgres or Neo4j will work since these are notoriously difficult to partition compared to the other solutions
Though likely everyone misunderstands the CAP theorem so I would read this a few times and internalize the example.
The three system qualities in 1 line
This chapter can effectively be summarized in 3 sentences:
- Scalability determines if this system can grow with the growth of your product. The best technique for this is partitioning.
- Reliability determines if this system produces correct results (nearly) each and every time. The best techniques for this are replication and checkpointing.
- Maintainability determines if this system can evolve with your team and is easy to understand, write, and extend.
Further reading and study
As I said before, this is a pretty simple chapter. I also watched this systems design walkthrough. This video extended these concepts and informed some of these notes. I like to accompany learnings with practice to seed new questions for our own interview process.
This article on YouTube’s architecture further reinforces the sample problem on the YouTube video (how meta). You can check your solution against the one that was really used by YouTube.
Finally, you can rifle through a bunch of these videos fairly quickly as each touches on a small subset of system design techniques.
Check-in next week with a summary of Chapter 2 of the book: Data Models & Query Languages!
Get the FREE UI crash course
Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.