data engineeringintermediate 12m2026-06-09

Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse

Session 28 of the 48-session learning series.

Why this session matters

This is Session 28 of 48 in the Data Engineering track. The schema you pick will outlive your job — fact tables get queried for years. Choosing between dimensional, data vault, and one-big-table is one of the highest-leverage decisions a DE makes. Get it wrong and every downstream team pays.

Agenda

Kimball dimensional modelling — facts, dimensions, star/snowflake
Inmon vs Kimball vs Data Vault — when each wins
Data Vault 2.0 — hubs, links, satellites, the audit story
One Big Table (OBT) — when denormalisation beats joins
Modern lakehouse hybrid — Bronze/Silver/Gold + dbt

Pre-read (skim before the session)

Deep dive

1. Why modelling matters

A bad model causes:

Queries that scan TBs because they can't push down filters.
Reports that disagree because the definition lives in 5 places.
New analysts who can't find anything for weeks.
Cost overruns because every join shuffles.

A good model is queryable (fewer tables, intuitive joins), correct (one source of truth per concept), and evolvable (you can change without rewriting downstream).

2. Kimball dimensional (star schema)

The classic. Big fact table at centre, small dimension tables around it.

              [ dim_customer ]
                     │
[ dim_product ] ─ [ fact_orders ] ─ [ dim_date ]
                     │
              [ dim_store ]

Fact — measurements (revenue, qty, click). One row per event. Foreign keys to dimensions. Mostly numeric.
Dimension — descriptive context (customer attributes, product hierarchy, date breakdown). Joined to filter/group facts.

Star (one dimension table per axis) vs snowflake (dimensions normalised). Default to star — joins are cheaper in MPP/columnar engines than the storage savings of snowflake.

3. Slowly Changing Dimensions (SCD)

Customer attributes change. How do you preserve history?

Type	Behaviour	When
SCD 0	Never change	Birth date
SCD 1	Overwrite	Fixing typos
SCD 2	New row + valid_from/valid_to	Need historical accuracy (most analytical needs)
SCD 3	Add `previous_value` column	Comparing then-vs-now once
SCD 4	History in separate table	Massive dimensions
SCD 6	Hybrid 1+2+3	Both current and historical

SCD 2 is the workhorse. Surrogate key (system-generated id) + business key + validity range.

4. Inmon vs Kimball

Inmon — top-down. Build a normalised 3NF enterprise data warehouse. Then build data marts (often dimensional) for departments.
Kimball — bottom-up. Build dimensional data marts directly, conformed via shared dimensions ("conformed bus").

Inmon: heavier upfront, more rigorous, slower. Kimball: faster delivery, occasional schema drift across marts.

Modern reality: most teams are Kimball-ish. The lakehouse Bronze layer is kind of Inmon (3NF-ish raw). Gold is dimensional.

5. Data Vault 2.0

Built for audit-heavy, regulated environments (banks, healthcare, government). Three artefact types:

Hub — business keys + load metadata. hub_customer(customer_bk, load_dts, source).
Link — associative; relates hubs. link_customer_order(customer_hk, order_hk, ...).
Satellite — descriptive attributes + history. sat_customer_demo(customer_hk, load_dts, name, city, ...).

Pros:

Insert-only — no updates → no destructive history loss.
Source-traceable — every record carries source + timestamp.
Highly parallel loads.
Schema-evolvable — new attribute = new satellite, no impact on existing.

Cons:

Verbose; 3–4x more tables than dimensional.
Queries need many joins → BI layer on top usually compiles down to dimensional.

Use when: regulator audits, many sources of same entity, frequent schema evolution. Skip for: small analytics shops, fast-moving startup.

6. One Big Table (OBT)

Single denormalised wide table. All attributes joined and flattened.

orders_wide(order_id, ts, customer_id, customer_name, customer_country,
            product_id, product_name, product_category, ...)

Pros:

Zero joins — fastest possible query.
Simplest mental model.
Plays nice with columnar pruning (no joins to break dim filters).

Cons:

Duplication — customer name in every row of every order.
Update nightmare — name change requires updating every row.
Storage cost.

Modern OBT thrives because:

Compression on columnar (Parquet) makes duplication cheap.
Compute is much more expensive than storage on cloud.
Many BI tools (Looker, Mode) flatten anyway.

Increasingly popular in lakehouse Gold layer. dbt + BigQuery / Snowflake users love it.

7. Bronze / Silver / Gold (lakehouse pattern)

Modern lakehouse layout:

Layer	Content	Modelling
Bronze	Raw landed data, immutable	Source schema; minimal transform
Silver	Cleaned, deduplicated, enriched	Normalised-ish; one row per business event
Gold	Business-ready, denormalised	Dimensional star OR OBT; per use-case

Decoupling layers lets you fix bugs in Gold without re-ingesting; rebuild Gold from Silver in hours.

8. dbt as the connective tissue

dbt = SQL + Jinja + DAG + tests. Models live in .sql files; dependencies inferred from {{ ref('upstream') }}.

Conventions that scale:

models/staging/ — Bronze → Silver, 1:1 with source.
models/intermediate/ — joins, mid-level transformations.
models/marts/ — Gold; per business area (finance, marketing, product).
Tests in schema.yml: not_null, unique, accepted_values, custom.
Snapshots for SCD 2.

9. Surrogate vs natural keys

Natural key (BK) — value with meaning (order_id, email). Can change, can be reused.
Surrogate key (SK) — system-generated id (UUID, hash, sequence). Stable.

Always carry both in dimensions. Facts reference SK (stable join), surface BK for human users.

10. Modelling time

Two date attributes per event:

Event time — when it actually happened (user clicked).
Processing time — when it landed in your warehouse.

Always log both. Watermarks (S23), late-arriving facts, replay correctness all depend on it.

dim_date table with pre-computed attributes (day_of_week, fiscal_quarter, is_holiday) saves countless joins later.

11. Anti-patterns

100-column dimension that mixes 3 entities → split.
Fact table with no FK to time → can't analyse trends.
Mart per dashboard → duplicate logic everywhere.
Materialised view as load mechanism → recomputation cost explodes.
Storing audit history in JSON columns → ungroupable.

12. Reality check

For a new project:

Bronze: raw landings, no transforms.
Silver: dedup, cast, enrich; one canonical event per concept.
Gold: dimensional star or OBT depending on consumers (BI = star; ML = OBT).
dbt for transforms; great_expectations / dbt tests for quality.
Catalog (DataHub, OpenMetadata) for discoverability.

You don't need Data Vault unless an auditor is asking. You don't need pure Inmon unless you have decades of legacy systems to harmonise.

Reading material

Books:

The Data Warehouse Toolkit, 3rd ed. — Ralph Kimball & Margy Ross (the dimensional-modelling bible; 30+ years and still the answer)
Building the Data Warehouse, 4th ed. — Bill Inmon (the third-normal-form/corporate-information-factory counterpoint)
Building a Scalable Data Warehouse with Data Vault 2.0 — Linstedt & Olschimke (the canonical Data Vault book)
The Data Warehouse Lifecycle Toolkit — Kimball Group (the project-management companion to the modelling book)
Fundamentals of Data Engineering — Joe Reis & Matt Housley (modern Lakehouse perspective on dim/vault/OBT)

Papers:

Functional Data Engineering — Maxime Beauchemin 2018 (Medium) — the post that defined the modern idempotent / pure batch model (essay, but cited everywhere).
Common Schemas for Data Warehouses — the Inmon vs Kimball debate (DAMA) — historical context for the trade-offs.
Beyond Star and Snowflake: A Survey of Modern OLAP Data Models — the academic landscape view.

Official docs:

dbt — How we structure our dbt projects (best practices) — the modern standard for staging / intermediate / marts.
dbt — Slowly Changing Dimensions (snapshots) — SCD2 the dbt way.
Databricks — Medallion architecture — Bronze/Silver/Gold; the Lakehouse name for staging/intermediate/marts.
Snowflake — Designing Tables — the cloud-DW take on physical design.
BigQuery — Schema design best practices — nested/repeated columns; the case for OBT.

Blog posts:

Locally Optimistic — modelling articles archive — the best practitioner blog for analytics engineering choices.
Modern Data Modelling for the Modern Data Stack — Tristan Handy (dbt Labs) — the manifesto from dbt's co-founder.
One Big Table — Coalesce article — the OBT argument and its trade-offs.

In-depth research material

dbt-core — github.com/dbt-labs/dbt-core — ~10k ★, the SQL-first transformation tool that won the modern data stack.
SQLMesh — github.com/TobikoData/sqlmesh — ~2k ★, the modern dbt alternative with semantic diffs + virtual envs.
Datavault4dbt — github.com/ScalefreeCOM/datavault4dbt — the canonical Data Vault implementation on dbt.
dbt-labs — jaffle-shop reference project — the canonical small dbt project; copy this structure.
Awesome Data Engineering — github.com/igorbarinov/awesome-data-engineering — the curated list.
Airbnb — The Anatomy of an Airbnb Data Model — dim modelling at Airbnb scale.
Lyft — Amundsen (data discovery) — the discovery layer on top of dim models.
GoCardless — The Modern Data Stack at GoCardless — a real end-to-end story.
Joe Reis — The Practical Data Modeler (Substack) — the modern reissuing of Kimball wisdom.
Snowflake — The Tasty Bytes Data Modelling Quickstart — hands-on dim modelling on Snowflake.

Videos

Kimball Dimensional Modelling Crash Course — Joe Reis — Joe Reis · 1 h 18 min — the canonical modern walk-through; the place to start.
Data Vault 2.0 in 60 minutes — Dan Linstedt — Dan Linstedt · 60 min — the inventor of Data Vault explaining it himself.
Dimensional Modeling with dbt — GitLab Data Team — 1 h — hands-on with the modern dbt patterns.
Medallion Architecture Explained — Databricks — 24 min — the Lakehouse spin on dim modelling.
Modern Data Modelling — Maxime Beauchemin — Maxime Beauchemin · 50 min — the Airflow/Superset creator on functional data engineering.

LeetCode — Relative Ranks

Link: https://leetcode.com/problems/relative-ranks/
Difficulty: Easy
Why this problem: Sort + assign positional labels — exactly the shape of building a slowly-changing dimension with current rank tracked.
Time-box: 20 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define fact, dimension, conformed dimension, surrogate key.
Choose between Kimball, Inmon, Data Vault, OBT for a given scenario.
Implement SCD Type 2 in SQL (valid_from / valid_to / surrogate key).
Lay out Bronze/Silver/Gold for a lakehouse with dbt.
Explain why OBT thrives on columnar storage despite the duplication.
Solve relative-ranks — sort + label assignment, a mini surrogate-key generator.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Recommender Systems — Two-Tower, Multi-Stage Ranking

Idiomatic Python (and a Touch of C++) — Type Hints, Protocols, Dataclasses