Data models and query languages

This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.

This chapter focuses on how you can organize and retrieve data from a database. It primarily focuses on the three primary data modeling paradigms:

  1. Relational
  2. Document
  3. Graph

Relational model

This is the incumbent. The relation is the table. The individual datum within the table is the tuple called the row. Properties of a table exist on the column. You know, like any other table you would see in the real world.

I grew up with these database systems. They are what I studied in college. My master’s thesis was on RDBMS. They have been the dominant force in data management for most of the last 40 years.

The most cited open-source examples of relational databases are MySQL and Postgres. They’re still useful and are the first choice for many frameworks like Ruby on Rails or Django. It’s just that they suffer from a few problems:

  • They cannot scale without considerable effort. They are much harder to replicate or partition than they should be.
  • Limited expressiveness. Querying and organizing relational schemas fits a specific mold. You have to be okay with that or else you’ll have to seek alternatives.

So what are the alternatives?

Document model and the rise of NoSQL

NoSQL (which people would now call “Not Only SQL”) describes any kind of database that doesn’t follow the relational model. One thing I want to highlight is the term polyglot persistence. This refers to the idea that you can have both relational and nonrelational databases working together. It’s not like you have to choose one paradigm over the other. In fact, it might be advantageous to use both for your application to serve different needs.

I think this is important to call out because NoSQL implies that you’re specifically not choosing SQL. It’s as though they can’t work together in an application but in fact, they can and do.

  • something like a resume could be a series of joins between a user and their job experiences OR you could nest those joins into one cohesive JSON model
  • just one thing to fetch
    • it actually looks like the thing you’re representing (a resume is a document) - also means many-to-one and many-to-many relationships don’t work so well
    • fast: requires no joins since it is self-contained (also means poor support for combining disjointed data)
    • highly flexible: no federated schema (also means no referential integrity)

How do you know the document model is right for you?

Are you working with documents like profiles, resumes, HTML pages? It’s great. If you can populate that whole document it’s very efficient. If you have a lot of nested relationships or links to other documents it starts to become cumbersome.

A list of jobs might be two separate tables in a relational model. In the document model, they will all be listed out. The exception is if you have created a separate document for all job types. At that point, you’re heading in the wrong direction.

One-to-many, tree-like data structure? Good. Many-to-many, graph-like data structure? Bad - stick with a relational or graph database.

So when does something look like a document in the real world? The two often-cited examples I see are:

  • CMS. The primary object of a blog engine is a blog post which is a document. The important thing is that it is unstructured. The content of the blog doesn’t need to fit a particular format.
  • E-commerce product catalog. Imagine a service like Amazon. They have tons of products. Other than the SKU there is really no mandatory format for structuring how you display a product. This falls into the large volumes of unstructured data category which could be a good fit for a document model database.

Mongo is to Python as Postgres is to Java

This analogy just kept swimming in my brain. One of the key benefits (and drawbacks) of a document database is its loose constraints on schemas. The book makes an interesting analogy in that it’s like runtime compilation: no schemas are validated on writes on the way into the database but you can validate a schema on read. This means you can only be assured of the structure of the document when it is being read from the database.

SQL is to CSS as JSON is to the DOM

In a similar but not nearly as clean analogy, we can further think of the paradigms of database programming languages. Relational databases use declarative SQL like CSS. They optimize around patterns rather than algorithms because you just declare how you want to filter out your data and you let the language choose the algorithm. This allows for terse, parallelizable code where the language can handle performance optimization and you focus on just getting your data.

Since NoSQL databases don’t necessarily use SQL their method of storage and retrieval revolves around their documents, commonly stored in formats like XML and JSON. In theory, you can use any old imperative programming language like JavaScript to interact with the database. This is nice in theory since you are programming UIs and business logic in the same programming paradigm. The downside is that your code is likely more verbose, uglier, and requires a specific algorithmic sequence to execute. Think of it like styling a page with JavaScript; you could - but why would you do that to yourself?

Writing is not my strength…

Reads are the name of the game here. Small, unmodified documents are best. Anything else you risk heavy performance penalties. Documents get rewritten as a whole so minor updates to big documents are a huge drain on performance. The benefit of having all of the data shoved into one document is to remove the need for complicated things like joins. The drawback is without joins you have to colocate data. This can become expensive if the document becomes expensive.

…and it may not matter

Postgres supports JSON columns. Mongo supports foreign references in a weird sort of join operation. The truth is, the document and relational models may not be all that different soon so you aren’t locked into one idiom between the major database providers anymore. The book goes into an example of this with the MapReduce extension on MongoDB. I didn’t think the example was worth stating here but the last sentence on this topic is a great summary of what I took away from this section:

The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise

Graph models

Remember how earlier I said one-to-many relationships were a good fit for document databases and many-to-many were good for graph or relational databases? Now let’s pick that apart further.

If your many-to-many relationship is simple, you can probably get away with a relational database. For complex many-to-many relationships, a graph database is a good choice.

What’s an example of a complex many-to-many relationship? You already know canonical examples:

  • Social networks. People know lots of other people. Just ask Kevin Bacon.
  • The web. Web crawlers are designed to demystify the tangled mess of connecting and indexing every website.
  • Transportation. Roads, railroads, and airports all connect points across the globe. This is a hard problem and one I’ve explored before.

To be honest, I found this section to be a bit dense on specific kinds of graph databases that may not be relevant today so I’ll gloss over each section with the critical parts I took away.

Property graphs and Cypher

Neo4j is the example I see all over the web on graph databases. I think if you just knew this one and Amazon Neptune (which seems to support all kinds of graph database models) you’d be fine.

This is how I would imagine you would design a graph data structure at scale. There are two main objects in a property graph DB: the vertices and the edges. Like a database, each of those things has a UUID. They also have information as to where they go and who uses them. And as the property name suggests, each of these objects also has metadata in the form of key-value pairs.

The power here is in the flexibility to build graphs and the tagging you can store to find your vertices and edges. Property graphs don’t restrict how you construct your graph. Graphs don’t require a schema other than the basic vertex and edge. So you’re free to design it however you like. The power of the model is in how you relate things.

These key-value tags make it easy to find information in the graph without ruining your data model. It also means you don’t have a traverse a graph to find information. Tags and IDs allow you to quickly lookup vertices or edges without having to go through all of the graph connections. They’re like the shortcuts to traveling across the network.

If you need to evolve your features it doesn’t ruin the data. The key-value pairs allow you to create new associations between your core objects without your graph fundamentally having to change. Sounds like Facebook would be a good example of this: people are your edges and their connections are the vertices. In the old days of Facebook, you could list out your favorite bands or movies. A property graph database would allow you to add something like TV shows without having to totally screw up your existing graph or its associations.

Cypher is to property graphs as SQL is to relations. That’s all you need to know here. It’s a declarative programming language for Neo4j in the same way SQL is a programming language for databases like MySQL, Postgres, MariaDB, and SQLite.

Triple-store graphs and SPARQL

This is another graph database variant. Instead of vertices and edges as the primary objects a triple-store graph uses, you guessed it, three objects: subjects, predicates, and objects. This one intuitively makes less sense to me than a property graph because I don’t think of a graph as I think of an English sentence.

Here is how you make sense of these three things:

  • Subjects are always vertices. They are nouns and stuff like people and fruit.
  • Objects and predicates have 1 of 2 states.
    1. The object is another vertex. Then the predicate is the edge connecting subject and object together.
    2. The object is a primitive type. Then the predicate is a property where the object is a data type of the subject.

The book lists Datomic as an example of a triple-store graph but it doesn’t market itself as a graph database but an amalgamation of many flexible data models. For practical purposes of this audience, you’re likely not to run into a scenario like this.

SPARQL is to triple-store graphs as Cypher is to property graphs. SPARQL is the programming language for triple-store graph databases using the RDF data model for the semantic web. Given that the Semantic Web never really took off, it’s more of a reason why you don’t need to invest more than a few paragraphs into understanding these things.

Further reading and study

Since this was a much meatier chapter, I found I had some additional notes from the previous walk-through that was relevant here. There is another video in that same series I used for practice this week.

I found myself referencing a few articles which may be helpful in understanding NoSQL databases:

  • https://shopify.engineering/five-common-data-stores-usage - Shopify did a nice job of simply explaining the use cases for five different kinds of data stores. The segmentation is a bit odd here and it doesn’t align with what I’ve read in the book but it’s still factual information even if the categorization is faulty.
  • https://www.prisma.io/dataguide/intro/comparing-database-types#relational-databases-working-with-tables-as-a-standard-solution-to-organize-well-structured-data - This Prisma article was great because it expanded my knowledge even further beyond NoSQL databases to NewSQL, multi-model, time series, and more.
  • https://blog.nahurst.com/visual-guide-to-nosql-systems - I like that this post has a diagram that revolves around the CAP theorem. It helped me place where NewSQL fits and how to actually tell the difference between HBase and Cassandra.
  • https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - Another mega list of NoSQL solutions which categorizes based on popularity rather than storage type. I actually really liked this because it simplified my thought process to a pretty rock solid list:
  1. Want relational? Use Postgres.
  2. Want an in-memory key-value store? Use Redis.
  3. Want relational without the schema? Use MongoDB.
  4. Want web analytics? Use Cassandra.
  5. Want a full-text search? Use ElasticSearch.
  6. Working with documents, specifically a CMS or CRM? Use CouchDB.
  7. Building a search engine or log analysis? Use HBase.
  8. Working with graphs? Use Neo4J or OrientDB.

System design tradeoffs between SQL and NoSQL solutions

A few questions you have to ask yourself when comparing SQL and NoSQL solutions as they relate to scalability:

How do you scale reads? Writes?

For SQL solutions, you’ll want to shard servers when the database runs out of space. Use a cluster proxy to abstract away the shards. Use a configuration service like ZooKeeper to ensure all shards are online and be sure to toggle backups in the event of a downed shard.

For NoSQL solutions, you’ll also want to perform sharing. However, there is no need for a cluster proxy because each shard knows about the other. Thus, no configuration service is needed.

How do you make reads/writes fast?

For SQL solutions, you’ll want to use a shard proxy to cache results and publish metrics, and terminate long-running processes.

For NoSQL solutions, no shard proxies are required because all nodes know about each other. Quorum nodes are used to ensure you don’t need to talk to every node to get a consensus on whether or not the operations succeed.

How do you prevent data loss?

For both SQL and NoSQL solutions, you’ll want to create read replicas of shards in different data centers.

How do you achieve consistency?

For SQL solutions, you’ll want to sync lead data to follower replica data. This leads to consistent data. It also takes a long time to process.

For NoSQL solutions, data is pushed to the read replicas asynchronously. This provides eventual consistency and is fast.

Other questions worth asking

  • How do you recover data in case of an outage? Data recovery is much harder for a relational system that sacrifices partition tolerance so you’ll need to focus more on cold storage backups while NoSQL solutions are much easier to replicate their replicas across centers.
  • How do you ensure the data is secure? I only saw mentions of Accumulo and Riak specializing in security. Otherwise, every database needs to worry about this regardless of storage type.
  • How extensible is the DB to an evolving data model? Evolving data models are better suited for a graph database than a document database.
  • How easy is it to run on/off-prem? This question was asked in the systems design video from last week but I don’t think most companies really need to worry about on or off-premises solutions anymore.
  • How much will it cost? Most solutions these days are open-source. The cost will come down to your strategy for your horizontal scaling and partitioning.

Get the FREE UI crash course

Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.

A new version of this app is available. Click here to update.