Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse
Session 28 of the 48-session learning series.
Why this session matters
This is Session 28 of 48 in the Data Engineering track. The schema you pick will outlive your job — fact tables get queried for years. Choosing between dimensional, data vault, and one-big-table is one of the highest-leverage decisions a DE makes. Get it wrong and every downstream team pays.
Agenda
- Kimball dimensional modelling — facts, dimensions, star/snowflake
- Inmon vs Kimball vs Data Vault — when each wins
- Data Vault 2.0 — hubs, links, satellites, the audit story
- One Big Table (OBT) — when denormalisation beats joins
- Modern lakehouse hybrid — Bronze/Silver/Gold + dbt
Pre-read (skim before the session)
- Kimball — The Data Warehouse Toolkit (3rd ed.) — Ch. 1
- Data Vault 2.0 — Dan Linstedt overview
- Maxime Beauchemin — The rise of the data engineer
- dbt — Modelling guide
Deep dive
1. Why modelling matters
A bad model causes:
- Queries that scan TBs because they can't push down filters.
- Reports that disagree because the definition lives in 5 places.
- New analysts who can't find anything for weeks.
- Cost overruns because every join shuffles.
A good model is queryable (fewer tables, intuitive joins), correct (one source of truth per concept), and evolvable (you can change without rewriting downstream).
2. Kimball dimensional (star schema)
The classic. Big fact table at centre, small dimension tables around it.
[ dim_customer ]
│
[ dim_product ] ─ [ fact_orders ] ─ [ dim_date ]
│
[ dim_store ]
- Fact — measurements (revenue, qty, click). One row per event. Foreign keys to dimensions. Mostly numeric.
- Dimension — descriptive context (customer attributes, product hierarchy, date breakdown). Joined to filter/group facts.
Star (one dimension table per axis) vs snowflake (dimensions normalised). Default to star — joins are cheaper in MPP/columnar engines than the storage savings of snowflake.
3. Slowly Changing Dimensions (SCD)
Customer attributes change. How do you preserve history?
| Type | Behaviour | When |
|---|---|---|
| SCD 0 | Never change | Birth date |
| SCD 1 | Overwrite | Fixing typos |
| SCD 2 | New row + valid_from/valid_to | Need historical accuracy (most analytical needs) |
| SCD 3 | Add previous_value column | Comparing then-vs-now once |
| SCD 4 | History in separate table | Massive dimensions |
| SCD 6 | Hybrid 1+2+3 | Both current and historical |
SCD 2 is the workhorse. Surrogate key (system-generated id) + business key + validity range.
4. Inmon vs Kimball
- Inmon — top-down. Build a normalised 3NF enterprise data warehouse. Then build data marts (often dimensional) for departments.
- Kimball — bottom-up. Build dimensional data marts directly, conformed via shared dimensions ("conformed bus").
Inmon: heavier upfront, more rigorous, slower. Kimball: faster delivery, occasional schema drift across marts.
Modern reality: most teams are Kimball-ish. The lakehouse Bronze layer is kind of Inmon (3NF-ish raw). Gold is dimensional.
5. Data Vault 2.0
Built for audit-heavy, regulated environments (banks, healthcare, government). Three artefact types:
- Hub — business keys + load metadata.
hub_customer(customer_bk, load_dts, source). - Link — associative; relates hubs.
link_customer_order(customer_hk, order_hk, ...). - Satellite — descriptive attributes + history.
sat_customer_demo(customer_hk, load_dts, name, city, ...).
Pros:
- Insert-only — no updates → no destructive history loss.
- Source-traceable — every record carries source + timestamp.
- Highly parallel loads.
- Schema-evolvable — new attribute = new satellite, no impact on existing.
Cons:
- Verbose; 3–4x more tables than dimensional.
- Queries need many joins → BI layer on top usually compiles down to dimensional.
Use when: regulator audits, many sources of same entity, frequent schema evolution. Skip for: small analytics shops, fast-moving startup.
6. One Big Table (OBT)
Single denormalised wide table. All attributes joined and flattened.
orders_wide(order_id, ts, customer_id, customer_name, customer_country,
product_id, product_name, product_category, ...)
Pros:
- Zero joins — fastest possible query.
- Simplest mental model.
- Plays nice with columnar pruning (no joins to break dim filters).
Cons:
- Duplication — customer name in every row of every order.
- Update nightmare — name change requires updating every row.
- Storage cost.
Modern OBT thrives because:
- Compression on columnar (Parquet) makes duplication cheap.
- Compute is much more expensive than storage on cloud.
- Many BI tools (Looker, Mode) flatten anyway.
Increasingly popular in lakehouse Gold layer. dbt + BigQuery / Snowflake users love it.
7. Bronze / Silver / Gold (lakehouse pattern)
Modern lakehouse layout:
| Layer | Content | Modelling |
|---|---|---|
| Bronze | Raw landed data, immutable | Source schema; minimal transform |
| Silver | Cleaned, deduplicated, enriched | Normalised-ish; one row per business event |
| Gold | Business-ready, denormalised | Dimensional star OR OBT; per use-case |
Decoupling layers lets you fix bugs in Gold without re-ingesting; rebuild Gold from Silver in hours.
8. dbt as the connective tissue
dbt = SQL + Jinja + DAG + tests. Models live in .sql files; dependencies inferred from {{ ref('upstream') }}.
Conventions that scale:
models/staging/— Bronze → Silver, 1:1 with source.models/intermediate/— joins, mid-level transformations.models/marts/— Gold; per business area (finance, marketing, product).- Tests in
schema.yml: not_null, unique, accepted_values, custom. - Snapshots for SCD 2.
9. Surrogate vs natural keys
- Natural key (BK) — value with meaning (
order_id,email). Can change, can be reused. - Surrogate key (SK) — system-generated id (UUID, hash, sequence). Stable.
Always carry both in dimensions. Facts reference SK (stable join), surface BK for human users.
10. Modelling time
Two date attributes per event:
- Event time — when it actually happened (user clicked).
- Processing time — when it landed in your warehouse.
Always log both. Watermarks (S23), late-arriving facts, replay correctness all depend on it.
dim_date table with pre-computed attributes (day_of_week, fiscal_quarter, is_holiday) saves countless joins later.
11. Anti-patterns
- 100-column dimension that mixes 3 entities → split.
- Fact table with no FK to time → can't analyse trends.
- Mart per dashboard → duplicate logic everywhere.
- Materialised view as load mechanism → recomputation cost explodes.
- Storing audit history in JSON columns → ungroupable.
12. Reality check
For a new project:
- Bronze: raw landings, no transforms.
- Silver: dedup, cast, enrich; one canonical event per concept.
- Gold: dimensional star or OBT depending on consumers (BI = star; ML = OBT).
- dbt for transforms; great_expectations / dbt tests for quality.
- Catalog (DataHub, OpenMetadata) for discoverability.
You don't need Data Vault unless an auditor is asking. You don't need pure Inmon unless you have decades of legacy systems to harmonise.
Reading material
Books:
- The Data Warehouse Toolkit, 3rd ed. — Ralph Kimball & Margy Ross (the dimensional-modelling bible; 30+ years and still the answer)
- Building the Data Warehouse, 4th ed. — Bill Inmon (the third-normal-form/corporate-information-factory counterpoint)
- Building a Scalable Data Warehouse with Data Vault 2.0 — Linstedt & Olschimke (the canonical Data Vault book)
- The Data Warehouse Lifecycle Toolkit — Kimball Group (the project-management companion to the modelling book)
- Fundamentals of Data Engineering — Joe Reis & Matt Housley (modern Lakehouse perspective on dim/vault/OBT)
Papers:
- Functional Data Engineering — Maxime Beauchemin 2018 (Medium) — the post that defined the modern idempotent / pure batch model (essay, but cited everywhere).
- Common Schemas for Data Warehouses — the Inmon vs Kimball debate (DAMA) — historical context for the trade-offs.
- Beyond Star and Snowflake: A Survey of Modern OLAP Data Models — the academic landscape view.
Official docs:
- dbt — How we structure our dbt projects (best practices) — the modern standard for staging / intermediate / marts.
- dbt — Slowly Changing Dimensions (snapshots) — SCD2 the dbt way.
- Databricks — Medallion architecture — Bronze/Silver/Gold; the Lakehouse name for staging/intermediate/marts.
- Snowflake — Designing Tables — the cloud-DW take on physical design.
- BigQuery — Schema design best practices — nested/repeated columns; the case for OBT.
Blog posts:
- Locally Optimistic — modelling articles archive — the best practitioner blog for analytics engineering choices.
- Modern Data Modelling for the Modern Data Stack — Tristan Handy (dbt Labs) — the manifesto from dbt's co-founder.
- One Big Table — Coalesce article — the OBT argument and its trade-offs.
In-depth research material
- dbt-core — github.com/dbt-labs/dbt-core — ~10k ★, the SQL-first transformation tool that won the modern data stack.
- SQLMesh — github.com/TobikoData/sqlmesh — ~2k ★, the modern dbt alternative with semantic diffs + virtual envs.
- Datavault4dbt — github.com/ScalefreeCOM/datavault4dbt — the canonical Data Vault implementation on dbt.
- dbt-labs — jaffle-shop reference project — the canonical small dbt project; copy this structure.
- Awesome Data Engineering — github.com/igorbarinov/awesome-data-engineering — the curated list.
- Airbnb — The Anatomy of an Airbnb Data Model — dim modelling at Airbnb scale.
- Lyft — Amundsen (data discovery) — the discovery layer on top of dim models.
- GoCardless — The Modern Data Stack at GoCardless — a real end-to-end story.
- Joe Reis — The Practical Data Modeler (Substack) — the modern reissuing of Kimball wisdom.
- Snowflake — The Tasty Bytes Data Modelling Quickstart — hands-on dim modelling on Snowflake.
Videos
- Kimball Dimensional Modelling Crash Course — Joe Reis — Joe Reis · 1 h 18 min — the canonical modern walk-through; the place to start.
- Data Vault 2.0 in 60 minutes — Dan Linstedt — Dan Linstedt · 60 min — the inventor of Data Vault explaining it himself.
- Dimensional Modeling with dbt — GitLab Data Team — 1 h — hands-on with the modern dbt patterns.
- Medallion Architecture Explained — Databricks — 24 min — the Lakehouse spin on dim modelling.
- Modern Data Modelling — Maxime Beauchemin — Maxime Beauchemin · 50 min — the Airflow/Superset creator on functional data engineering.
LeetCode — Relative Ranks
- Link: https://leetcode.com/problems/relative-ranks/
- Difficulty: Easy
- Why this problem: Sort + assign positional labels — exactly the shape of building a slowly-changing dimension with current rank tracked.
- Time-box: 20 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Define fact, dimension, conformed dimension, surrogate key.
- Choose between Kimball, Inmon, Data Vault, OBT for a given scenario.
- Implement SCD Type 2 in SQL (valid_from / valid_to / surrogate key).
- Lay out Bronze/Silver/Gold for a lakehouse with dbt.
- Explain why OBT thrives on columnar storage despite the duplication.
- Solve
relative-ranks— sort + label assignment, a mini surrogate-key generator.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.