Data models and query languages

This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.

This chapter focuses on how you can organize and retrieve data from a database. It primarily focuses on the three primary data modeling paradigms:

Relational
Document
Graph

Relational model

This is the incumbent. The relation is the table. The individual datum within the table is the tuple called the row. Properties of a table exist on the column. You know, like any other table you would see in the real world.

I grew up with these database systems. They are what I studied in college. My master’s thesis was on RDBMS. They have been the dominant force in data management for most of the last 40 years.

The most cited open-source examples of relational databases are MySQL and Postgres. They’re still useful and are the first choice for many frameworks like Ruby on Rails or Django. It’s just that they suffer from a few problems:

They cannot scale without considerable effort. They are much harder to replicate or partition than they should be.
Limited expressiveness. Querying and organizing relational schemas fits a specific mold. You have to be okay with that or else you’ll have to seek alternatives.

So what are the alternatives?

Document model and the rise of NoSQL

NoSQL (which people would now call “Not Only SQL”) describes any kind of database that doesn’t follow the relational model. One thing I want to highlight is the term polyglot persistence. This refers to the idea that you can have both relational and nonrelational databases working together. It’s not like you have to choose one paradigm over the other. In fact, it might be advantageous to use both for your application to serve different needs.

I think this is important to call out because NoSQL implies that you’re specifically not choosing SQL. It’s as though they can’t work together in an application but in fact, they can and do.

something like a resume could be a series of joins between a user and their job experiences OR you could nest those joins into one cohesive JSON model
just one thing to fetch
- it actually looks like the thing you’re representing (a resume is a document) - also means many-to-one and many-to-many relationships don’t work so well
- fast: requires no joins since it is self-contained (also means poor support for combining disjointed data)
- highly flexible: no federated schema (also means no referential integrity)

How do you know the document model is right for you?

Are you working with documents like profiles, resumes, HTML pages? It’s great. If you can populate that whole document it’s very efficient. If you have a lot of nested relationships or links to other documents it starts to become cumbersome.

A list of jobs might be two separate tables in a relational model. In the document model, they will all be listed out. The exception is if you have created a separate document for all job types. At that point, you’re heading in the wrong direction.

One-to-many, tree-like data structure? Good. Many-to-many, graph-like data structure? Bad - stick with a relational or graph database.

So when does something look like a document in the real world? The two often-cited examples I see are:

CMS. The primary object of a blog engine is a blog post which is a document. The important thing is that it is unstructured. The content of the blog doesn’t need to fit a particular format.
E-commerce product catalog. Imagine a service like Amazon. They have tons of products. Other than the SKU there is really no mandatory format for structuring how you display a product. This falls into the large volumes of unstructured data category which could be a good fit for a document model database.

Mongo is to Python as Postgres is to Java

This analogy just kept swimming in my brain. One of the key benefits (and drawbacks) of a document database is its loose constraints on schemas. The book makes an interesting analogy in that it’s like runtime compilation: no schemas are validated on writes on the way into the database but you can validate a schema on read. This means you can only be assured of the structure of the document when it is being read from the database.

SQL is to CSS as JSON is to the DOM

In a similar but not nearly as clean analogy, we can further think of the paradigms of database programming languages. Relational databases use declarative SQL like CSS. They optimize around patterns rather than algorithms because you just declare how you want to filter out your data and you let the language choose the algorithm. This allows for terse, parallelizable code where the language can handle performance optimization and you focus on just getting your data.

Since NoSQL databases don’t necessarily use SQL their method of storage and retrieval revolves around their documents, commonly stored in formats like XML and JSON. In theory, you can use any old imperative programming language like JavaScript to interact with the database. This is nice in theory since you are programming UIs and business logic in the same programming paradigm. The downside is that your code is likely more verbose, uglier, and requires a specific algorithmic sequence to execute. Think of it like styling a page with JavaScript; you could - but why would you do that to yourself?

Writing is not my strength…

Reads are the name of the game here. Small, unmodified documents are best. Anything else you risk heavy performance penalties. Documents get rewritten as a whole so minor updates to big documents are a huge drain on performance. The benefit of having all of the data shoved into one document is to remove the need for complicated things like joins. The drawback is without joins you have to colocate data. This can become expensive if the document becomes expensive.

…and it may not matter

Postgres supports JSON columns. Mongo supports foreign references in a weird sort of join operation. The truth is, the document and relational models may not be all that different soon so you aren’t locked into one idiom between the major database providers anymore. The book goes into an example of this with the MapReduce extension on MongoDB. I didn’t think the example was worth stating here but the last sentence on this topic is a great summary of what I took away from this section:

The moral of the story is that a NoSQL system may find itself accidentally reinventing SQL, albeit in disguise

Graph models

Remember how earlier I said one-to-many relationships were a good fit for document databases and many-to-many were good for graph or relational databases? Now let’s pick that apart further.

If your many-to-many relationship is simple, you can probably get away with a relational database. For complex many-to-many relationships, a graph database is a good choice.

What’s an example of a complex many-to-many relationship? You already know canonical examples:

Social networks. People know lots of other people. Just ask Kevin Bacon.
The web. Web crawlers are designed to demystify the tangled mess of connecting and indexing every website.
Transportation. Roads, railroads, and airports all connect points across the globe. This is a hard problem and one I’ve explored before.

To be honest, I found this section to be a bit dense on specific kinds of graph databases that may not be relevant today so I’ll gloss over each section with the critical parts I took away.

Property graphs and Cypher

Neo4j is the example I see all over the web on graph databases. I think if you just knew this one and Amazon Neptune (which seems to support all kinds of graph database models) you’d be fine.

This is how I would imagine you would design a graph data structure at scale. There are two main objects in a property graph DB: the vertices and the edges. Like a database, each of those things has a UUID. They also have information as to where they go and who uses them. And as the property name suggests, each of these objects also has metadata in the form of key-value pairs.

The power here is in the flexibility to build graphs and the tagging you can store to find your vertices and edges. Property graphs don’t restrict how you construct your graph. Graphs don’t require a schema other than the basic vertex and edge. So you’re free to design it however you like. The power of the model is in how you relate things.

These key-value tags make it easy to find information in the graph without ruining your data model. It also means you don’t have a traverse a graph to find information. Tags and IDs allow you to quickly lookup vertices or edges without having to go through all of the graph connections. They’re like the shortcuts to traveling across the network.

If you need to evolve your features it doesn’t ruin the data. The key-value pairs allow you to create new associations between your core objects without your graph fundamentally having to change. Sounds like Facebook would be a good example of this: people are your edges and their connections are the vertices. In the old days of Facebook, you could list out your favorite bands or movies. A property graph database would allow you to add something like TV shows without having to totally screw up your existing graph or its associations.

Cypher is to property graphs as SQL is to relations. That’s all you need to know here. It’s a declarative programming language for Neo4j in the same way SQL is a programming language for databases like MySQL, Postgres, MariaDB, and SQLite.

Triple-store graphs and SPARQL

This is another graph database variant. Instead of vertices and edges as the primary objects a triple-store graph uses, you guessed it, three objects: subjects, predicates, and objects. This one intuitively makes less sense to me than a property graph because I don’t think of a graph as I think of an English sentence.

Here is how you make sense of these three things:

Subjects are always vertices. They are nouns and stuff like people and fruit.
Objects and predicates have 1 of 2 states.
1. The object is another vertex. Then the predicate is the edge connecting subject and object together.
2. The object is a primitive type. Then the predicate is a property where the object is a data type of the subject.

The book lists Datomic as an example of a triple-store graph but it doesn’t market itself as a graph database but an amalgamation of many flexible data models. For practical purposes of this audience, you’re likely not to run into a scenario like this.

SPARQL is to triple-store graphs as Cypher is to property graphs. SPARQL is the programming language for triple-store graph databases using the RDF data model for the semantic web. Given that the Semantic Web never really took off, it’s more of a reason why you don’t need to invest more than a few paragraphs into understanding these things.

Data models and query languages

Relational model

Document model and the rise of NoSQL

How do you know the document model is right for you?

Mongo is to Python as Postgres is to Java

SQL is to CSS as JSON is to the DOM

Writing is not my strength…

…and it may not matter

Graph models

Property graphs and Cypher

Triple-store graphs and SPARQL

Further reading and study

System design tradeoffs between SQL and NoSQL solutions

How do you scale reads? Writes?

How do you make reads/writes fast?

How do you prevent data loss?

How do you achieve consistency?

Other questions worth asking

Get the FREE UI crash course

Relational model

Document model and the rise of NoSQL

How do you know the document model is right for you?

Mongo is to Python as Postgres is to Java

SQL is to CSS as JSON is to the DOM

Writing is not my strength…

…and it may not matter

Graph models

Property graphs and Cypher

Triple-store graphs and SPARQL

Further reading and study

System design tradeoffs between SQL and NoSQL solutions

How do you scale reads? Writes?

How do you make reads/writes fast?

How do you prevent data loss?

How do you achieve consistency?

Other questions worth asking

Liked this post? You'll probably also like:

Get the FREE UI crash course