Databases and State

Every product decision about speed, consistency, and reliability is actually a decision about how you store and query data.

13 min read

Your database is where your application keeps everything it needs to remember between requests. User accounts, transaction history, preferences, content — all of it lives in a database. The database you choose determines how fast you can query data, how much data you can store, whether you can guarantee consistency, and how much it costs to operate. These constraints shape what your product can do.

Relational databases like PostgreSQL and MySQL organize data into tables with predefined schemas. Each row is a record, each column is a field, and relationships between tables are enforced with foreign keys. The strength of relational databases is structure and consistency — they guarantee that your data follows rules and that transactions are atomic. The weakness is rigidity: changing your schema as your product evolves requires careful migrations.

NoSQL databases like MongoDB, DynamoDB, and Cassandra trade structure for flexibility. Instead of tables and rows, you store documents or key-value pairs. You can add fields to documents without touching every other record. The schema is whatever you decide to write. The cost is that you lose guarantees: NoSQL databases prioritize availability and partition tolerance over strict consistency, which means two users might see different versions of the same data for a brief window.

The CAP theorem states that in a distributed database, you can have at most two of three properties: consistency, availability, and partition tolerance. Partition tolerance is non-negotiable in any distributed system — network failures happen. That leaves you choosing between consistency (everyone sees the same data immediately) and availability (the system always responds, even if the data is slightly stale). Relational databases choose consistency. Most NoSQL databases choose availability.

Read-heavy and write-heavy workloads require different optimizations. A news site that serves millions of page views per hour but publishes articles only a few times per day is read-heavy — you can cache aggressively and serve stale data. A stock trading platform that processes thousands of orders per second is write-heavy — you need fast writes and strong consistency. If your engineers cannot tell you whether your workload is read-heavy or write-heavy, you are optimizing blind.

Database indexes are how you make queries fast. An index is a data structure that lets the database find rows without scanning the entire table. Without an index, querying a million-row table for a specific user ID means checking all million rows. With an index, it is logarithmic. The cost is storage and slower writes — every index must be updated when data changes. Over-indexing slows down your database. Under-indexing makes it unusable.

Vertical scaling means adding more CPU, RAM, or disk to a single database server. Horizontal scaling means splitting data across multiple servers. Relational databases scale vertically well but struggle to scale horizontally because maintaining consistency across distributed nodes is hard. NoSQL databases are designed for horizontal scaling but sacrifice consistency to achieve it. Your choice depends on whether you expect to hit the limits of a single server and how much you care about strict consistency.

Database migrations are how you change your schema in production without breaking your application. Adding a column is straightforward. Renaming a column requires deploying code that handles both the old and new names simultaneously, then migrating data, then removing support for the old name. Dropping a column requires ensuring no code references it first. Migrations gone wrong cause downtime. If your engineers are not testing migrations in staging before running them in production, you will have an outage.

Choose relational when your data has structure and rules. Choose NoSQL when your data model is still evolving or your scale requires it.

This lesson is coming soon.

TERMS

Term of focus

Relational database (SQL)

A database that organizes data into tables with predefined columns and enforces relationships between tables. SQL (Structured Query Language) is the standard language for querying these databases. Relational databases guarantee consistency and enforce constraints, making them ideal for financial transactions, user accounts, and any domain where data integrity is critical.

A broad category of databases that do not use fixed table schemas — includes document stores (MongoDB), key-value stores (Redis, DynamoDB), wide-column stores (Cassandra), and graph databases (Neo4j). NoSQL databases prioritize flexibility, horizontal scalability, and availability over strict consistency. Use them when your data model changes frequently or when you need to scale beyond what a single relational database can handle.

The structure that defines how data is organized — what tables exist, what columns each table has, what data types are allowed, and what relationships exist between tables. In relational databases, the schema is enforced by the database. In NoSQL, the schema is enforced by your application code (or not enforced at all). Changing a schema in production requires careful planning.

Atomicity, Consistency, Isolation, Durability — the properties that guarantee database transactions are reliable. Atomicity means a transaction either completes fully or not at all (no partial writes). Consistency means the database enforces all rules. Isolation means concurrent transactions do not interfere with each other. Durability means committed data survives crashes. Relational databases provide ACID guarantees. Most NoSQL databases do not.

A fundamental constraint in distributed systems: you can have at most two of Consistency (all nodes see the same data), Availability (system always responds), and Partition tolerance (system works despite network failures). Since network failures are unavoidable, you must choose between consistency and availability. Understanding this trade-off is essential when evaluating database options.

A data structure that speeds up queries by allowing the database to find rows without scanning the entire table. An index on the email column lets you find a user by email in milliseconds instead of seconds. Every index consumes storage and slows down writes. The art of database performance is indexing the right columns without over-indexing.

Splitting a database horizontally across multiple servers by dividing data along a shard key (e.g., user ID ranges, geographic regions). Sharding is how you scale writes beyond what one server can handle, but it introduces complexity: queries that span shards are slow, and rebalancing shards as data grows is operationally difficult.

BEFORE YOUR NEXT MEETING

— Do we know whether our workload is read-heavy or write-heavy, and have we optimized our database configuration for that pattern?

— If our database died right now, how much data would we lose — minutes, hours, or none — and is that acceptable for our business?

— Can you explain what indexes exist on our most queried tables and whether we are missing any critical ones?

— What happens to our application if the database becomes unavailable for 30 seconds — does it degrade gracefully or does everything break?

— How do we test database migrations in staging, and what is our rollback plan if a migration breaks production?

REALITY CHECK

SOURCES

↗Martin Kleppmann — "Designing Data-Intensive Applications" (2017)

↗Eric Brewer — "CAP Twelve Years Later: How the Rules Have Changed" (2012)

↗Use The Index, Luke — "SQL Indexing and Tuning Guide"

↗AWS — "Amazon DynamoDB Deep Dive" (2023)

↗Charity Majors — "Operational Best Practices for Databases" (2020)

↗Postgres Documentation — "Performance Tips"

LESSON 02 OF 04