system designintermediate 12m2026-06-09

Sharding & Replication — Partition Keys, Hot Spots, Multi-Region

Session 18 of the 48-session learning series.

Why this session matters

This is Session 18 of 48 in the System Design track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Vertical vs horizontal partitioning — when each makes sense
Sharding strategies — range, hash, consistent hashing, directory
Hot keys and hot shards — detection and mitigation
Replication topologies — leader-follower, multi-leader, leaderless
Multi-region — active/passive, active/active, conflict resolution

Pre-read (skim before the session)

When data outgrows one machine, you have three choices: bigger machine (vertical), more machines (horizontal), or less data (deletion / archival). Vertical scaling hits limits — fastest single-server SSD writes top out at ~3 GB/s, RAM at TBs but $$$ exponential.

Horizontal partitioning (= sharding) splits the data set across nodes. Each shard holds a subset; routing layer figures out which.

2. Range partitioning

Sort by key (timestamp, user_id), assign contiguous ranges to shards.

shard 1: user_id 0       .. 100_000
shard 2: user_id 100_001 .. 200_000
shard 3: user_id 200_001 .. 300_000

Pros: range scans are local. Great for time-series (logs, metrics).

Cons: hot spots if writes cluster in one range (recent timestamps → shard with now()). Mitigate by bucketing time + hash prefix.

Used by: HBase, Bigtable, MongoDB (default), CockroachDB.

3. Hash partitioning

Hash the key, mod by shard count.

shard = hash(user_id) % N

Pros: uniform distribution → no hot shards on uniform keys.

Cons: range scans now touch every shard. Resharding (changing N) reshuffles all data.

Used by: Cassandra (token rings), Redis Cluster.

4. Consistent hashing — the fix to "all data reshuffled"

Map both keys and nodes onto a ring of size 2^64. Each key belongs to the next node clockwise on the ring. Add a node → only the keys between it and its predecessor move. Remove a node → its keys go to its successor.

       N1
       │
N4 ────●──── N2
       │
       N3

key 0xABCD lands between N3 and N4 → owned by N4.

Add N5 between N3 and N4 → only some of N4's keys move to N5.

Virtual nodes — give each physical node 100–1000 virtual positions on the ring. Smooths load when nodes are heterogeneous and limits reshuffling on add/remove.

Used by: Cassandra, Dynamo, Riak, memcached client libs.

5. Directory-based — a lookup table

Maintain a service that maps key → shard. Total flexibility, can rebalance arbitrarily. Cost: extra hop, the directory itself must be HA.

Used by: Vitess (MySQL sharding), Citus (Postgres), HDFS NameNode.

6. Picking a partition key

The single most consequential schema decision. Optimise for:

Even load — avoid keys with skewed distribution (country, tenant_id with one huge tenant).
Locality of access — queries should touch one or few shards. Chat app: shard by chat_id, not user_id.
Cardinality — must have enough distinct values to spread.

Composite keys (e.g., (tenant_id, user_id) with hash on tenant_id, range on user_id) are common in multi-tenant systems.

7. Hot keys

The Justin Bieber problem — one celebrity follower count update triggers writes to one shard at 100K QPS.

Mitigations:

Salting: write to key:0, key:1, …, key:99; reads aggregate. Trades read amplification for write distribution.
Read-side caching: hot keys live in CDN/Redis; the DB sees << QPS.
Write coalescing: batch many "+1" into a single "+N" every 100 ms.
Dedicated shard: pin known hot keys to a dedicated, larger shard.

8. Hot shards (vs hot keys)

A hot shard gets disproportionate traffic even with no single hot key — e.g., partitioned by region and one region is 10× others. Mitigate by:

Re-partition with a more uniform key.
Subshard the hot shard.
Use a routing layer that biases reads to replicas.

9. Replication topologies

Leader-follower (single-leader, master/slave):

One node accepts writes; others copy.
Failover requires election or manual promotion.
Used by: Postgres, MySQL, MongoDB replica set, Kafka.

Multi-leader:

Two+ leaders, each accepts writes; they replicate to each other.
Write conflict resolution is the hard problem (LWW, CRDT, application code).
Used by: rare in OLTP; common in cross-region BDR setups.

Leaderless:

Any node accepts writes; clients write to W replicas, read from R, ensure W + R > N.
Read repair + anti-entropy reconcile divergence.
Used by: Dynamo, Cassandra, Riak.

10. Multi-region

Active/passive (DR site) — writes go to one region; passive replicates async. RPO = replication lag (seconds), RTO = failover time (minutes). Cheap, simple, lossy.

Active/active (multi-region writes) — writes accepted everywhere. Three flavours:

Geo-partitioned — each region owns a subset of keys (EU rows in EU, US rows in US). No conflicts. Latency local for owned keys, cross-region for others. CockroachDB, Spanner, Cosmos DB.
CRDT-based — eventual consistency with deterministic merge. Counter, set, JSON. Riak, Yjs, Automerge.
Last-Write-Wins on timestamps — simple, lossy on concurrent writes. Cassandra default.

11. Spanner / CockroachDB — the cheating answer

Synchronised clocks (TrueTime) or hybrid logical clocks let these systems give you strong consistency across regions at ~10–50 ms write latency. They use Paxos/Raft groups per partition, replicated across regions. Most fintech and any greenfield CP-needing system now starts here.

12. Operational realities

Resharding is the worst day of the year. Plan capacity early; double when you cross 70%.
Backfills — adding a column or migrating shards while writes continue. Always idempotent, always restartable.
Read-your-writes — even on a single region, reading from a replica can return stale data. Use leader reads or "session consistency" tokens.
Schema changes — Postgres-style online DDL has limits; in sharded systems, schema migrations need to be backwards/forwards compatible.

13. Sizing rule of thumb

Total disk / shard ≤ 60% of node disk          (room for compactions, backups)
RPS per shard ≤ 70% of saturation              (room for failures, hot spots)
Replication factor = 3                          (industry default)
Cross-region replicas = 1 per region            (active/passive minimum)

Reading material

Books:

Designing Data-Intensive Applications — Martin Kleppmann (chs. 5–7: replication, partitioning, transactions — the canonical reference)
Database Internals — Alex Petrov (Part II on distributed systems, leader election, replication logs)
Building Microservices, 2nd ed. — Sam Newman (the chapter on data + ownership across regions)

Papers:

Dynamo: Amazon's Highly Available Key-value Store — DeCandia et al. 2007 — the consistent-hashing, quorum, vector-clock origin story.
Spanner: Google's Globally Distributed Database — Corbett et al. 2012 — TrueTime + externally consistent multi-region.
Consistent Hashing and Random Trees — Karger et al. 1997 — the paper that named consistent hashing.
Slicer: Auto-Sharding for Datacenter Applications — Adya et al. 2016 (Google) — Google's general-purpose sharding service.

Official docs:

CockroachDB Architecture — ranges, leaseholders, raft groups.
Vitess — Sharding — the YouTube/Slack MySQL sharding stack.
DynamoDB — Partitioning and Distribution — hash + range partitions, hot partition mitigation.
MongoDB Sharding — the chunk + balancer model.

Blog posts:

How Discord Stores Trillions of Messages — Discord Engineering — sharding Cassandra/ScyllaDB by channel.
Sharding & IDs at Instagram — Instagram Engineering — the Snowflake-style ID generator and Postgres sharding.
Replication in Distributed Systems — Martin Kleppmann (blog post version of DDIA ch. 5)

In-depth research material

Vitess — github.com/vitessio/vitess — ~18k ★, MySQL sharding used at YouTube, Slack, Square.
TiDB — github.com/pingcap/tidb — ~37k ★, MySQL-compatible distributed SQL with auto-sharding.
CockroachDB — github.com/cockroachdb/cockroach — ~29k ★, Spanner-style multi-region SQL.
Discord — How Discord Stores Billions of Messages — the earlier Cassandra-era post; pairs with the trillions follow-up.
Figma — How we built a 50x faster Postgres — horizontal sharding Postgres without going Spanner.
Notion — The journey to sharding — Postgres sharding for a 200B+ block dataset.
CRDT survey — Shapiro et al. 2011 — the convergent replicated data types canon for multi-master.
Aurora paper — Verbitski et al. 2017 — storage-replication done at the page level.
Jepsen — distributed systems safety analyses — every replication bug in production, dissected.
Spanner: Becoming a SQL System — Bacon et al. 2017 — follow-up paper on Spanner's SQL layer.

Videos

Consistent Hashing | Algorithms You Should Know — ByteByteGo · 7 min — the canonical animated explanation; watch first.
How do databases scale? Sharding, Replication — ByteByteGo — ByteByteGo · 14 min — strategies and trade-offs in plain English.
CMU 15-721 — Database Sharding & Partitioning — Andy Pavlo · 1 h 14 min — the deep academic treatment; partitioning schemes, query routing.
How Google Spanner Works — ByteByteGo / Spanner authors — 16 min — TrueTime + global consistency walkthrough.
Vitess: How we scale Slack's MySQL fleet — Slack/Vitess · 31 min — a real shop's sharding playbook.

LeetCode — Consistent Hashing Design

Link: https://leetcode.com/problems/consistent-hashing-design/
Difficulty: Medium
Why this problem: Treat as design — a TreeMap of hash→node; lookup is upper-bound.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick a partition strategy for a chat app, a time-series store, a multi-tenant SaaS.
Diagram consistent hashing with virtual nodes; explain what happens on node add.
List 3 mitigations for a hot key and 2 for a hot shard.
Compare leader-follower, multi-leader, leaderless on conflict resolution.
Explain why Spanner can be strongly consistent across regions.
Solve consistent-hashing-design — TreeMap of hash → node; upper-bound lookup.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Embeddings, Vector Spaces, Contrastive Learning

Lakehouse — Delta Lake, Iceberg, Hudi, ACID on Object Storage