Data Warehouse Architecture Patterns: Kimball, Inmon, and the Modern Lakehouse
The architecture debate has been running since the 1990s. Here is what each approach actually offers, where each falls short, and how modern data stacks have synthesized the best of all of them.
Most data engineers encounter one architecture in practice and assume it is the default. The Kimball vs. Inmon debate feels historical and irrelevant until you join a company using the other approach and realize the assumptions baked into your mental model do not apply. Understanding all the major patterns, including the modern lakehouse synthesis, makes you a better architect and a more effective technical communicator across teams with different histories.
Kimball: Bottom-Up, Business-First
Ralph Kimball's dimensional modeling approach builds the warehouse from the bottom up, starting with individual business processes and creating data marts for each one. A data mart serves a specific business domain: sales, finance, marketing. The enterprise warehouse emerges from the integration of these marts rather than being designed centrally first.
The core artifacts are fact tables (measurements: revenue, clicks, orders) and dimension tables (context: customers, products, dates). These form star schemas, which are optimized for analytical queries because they minimize join complexity and map to how business users think about their data.
-- Kimball star schema: fact table + dimensions fct_orders ├── order_id (PK) ├── customer_sk (FK → dim_customers) ├── product_sk (FK → dim_products) ├── date_sk (FK → dim_date) ├── revenue └── quantity dim_customers ├── customer_sk (surrogate PK) ├── customer_id (natural key) ├── name, email, segment, country ├── effective_from, effective_to (SCD Type 2) └── is_current
Kimball's strengths: business users can query the mart directly with minimal SQL knowledge. Delivery is fast because you build one domain at a time. The dimensional model maps naturally to BI tool concepts (dimensions as filters, facts as measures).
Kimball's weaknesses: conformed dimensions (shared dimensions used consistently across marts) are hard to maintain as the number of marts grows. Inconsistent grain definitions across marts create data discrepancies that are difficult to diagnose. The bottom-up approach can produce marts that are hard to integrate later when cross-domain questions arise.
Inmon: Top-Down, Enterprise-First
Bill Inmon's enterprise data warehouse approach is top-down: design a normalized, subject-oriented, integrated warehouse first, then build data marts as views or aggregates on top of it. The central warehouse is the single source of truth. Marts are derived from it, not independent.
The central warehouse stores data in third normal form (3NF): normalized to eliminate redundancy, subject-oriented (customer, product, transaction rather than source-system-oriented), and integrated across all source systems. This is the canonical "one version of the truth" architecture.
-- Inmon 3NF: normalized, subject-oriented customers (customer_id, name, email) customer_segments (customer_id, segment_id, effective_date) segments (segment_id, segment_name, description) orders (order_id, customer_id, order_date) order_lines (order_id, product_id, quantity, unit_price) products (product_id, name, category_id) categories (category_id, name, parent_id) -- Data mart built on top for analysts CREATE VIEW sales_mart AS SELECT c.name, s.segment_name, SUM(ol.quantity * ol.unit_price) AS revenue FROM orders o JOIN order_lines ol USING (order_id) JOIN customers c USING (customer_id) JOIN customer_segments cs USING (customer_id) JOIN segments s USING (segment_id) GROUP BY c.name, s.segment_name;
Inmon's strengths: the normalized central warehouse is flexible and can answer questions that were not anticipated when the schema was designed. Data quality and integration is enforced centrally. There is a genuine single source of truth.
Inmon's weaknesses: the upfront design effort is substantial. Time-to-first-insight is much longer than Kimball. Normalized schemas require more complex queries for business users. The iterative delivery that modern data teams expect is harder to achieve when everything flows from a central schema.
Data Vault: Flexibility and Auditability
Data Vault is a methodology designed for enterprise data warehouses where auditability, schema flexibility, and source tracking matter above all else. It decomposes every data entity into three types: hubs (business keys only), links (relationships between hubs), and satellites (descriptive attributes with full history).
-- Data Vault: hubs, links, satellites hub_customer (customer_hk, customer_id, load_date, record_source) hub_order (order_hk, order_id, load_date, record_source) link_customer_order (customer_order_hk, customer_hk, order_hk, load_date) sat_customer_details (customer_hk, load_date, name, email, segment, hash_diff) sat_order_details (order_hk, load_date, status, amount, hash_diff)
Data Vault's strengths: extremely flexible to schema changes in source systems. Complete auditability (every record has load date and source). Supports parallel loading because hubs, links, and satellites load independently.
Data Vault's weaknesses: high structural complexity. Most consumers need a "business vault" or information mart layer on top because the raw vault is not queryable by analysts. Tooling and expertise are less common than for Kimball or Inmon. Generally overkill for anything except regulated enterprise environments.
The Modern Lakehouse: The Synthesis
The lakehouse architecture separates storage from compute and uses open table formats (Apache Iceberg, Delta Lake, Apache Hudi) on object storage (S3, GCS) as the storage layer. Compute engines (Spark, Trino, DuckDB, Snowflake, BigQuery) read from and write to the same storage layer.
-- Modern lakehouse layer structure
s3://data-lake/
bronze/ # Raw, source-aligned (Inmon influence)
orders/
customers/
silver/ # Cleaned, validated, integrated
fct_orders/
dim_customers/
gold/ # Business-ready, query-optimized
sales_mart/ # Kimball-style star schema
entity_customers/ # Entity-centric
-- Same data, multiple compute engines
Spark → bronze to silver ETL jobs
dbt → silver to gold transformations
Trino → ad-hoc queries by analysts
DuckDB → local development and testingThe lakehouse combines Inmon's centralized, normalized raw layer with Kimball's business-optimized mart layer. The bronze/silver/gold (or raw/staging/marts) naming convention reflects this synthesis: bronze is source-aligned and integrated (Inmon-ish), gold is consumer-ready with dimensional models (Kimball-ish).
Lakehouse strengths: decoupled storage and compute allows cost optimization (pay for compute only when querying). Multiple engines can serve different use cases from the same data. Open formats avoid vendor lock-in. The layer structure provides both the integration discipline of Inmon and the query performance of Kimball.
Lakehouse weaknesses: more moving parts than a managed warehouse (Snowflake, BigQuery). Operational complexity of managing object storage, table format metadata, and multiple compute engines. The "best of both worlds" marketing often undersells the engineering work required to make it function well.
How to Choose
Startup or small team building the first warehouse: Kimball with dbt on a cloud warehouse (Snowflake, BigQuery, or Redshift). Delivers quickly, the tooling is excellent, and the star schema model is familiar enough that analysts can query it without deep SQL expertise.
Enterprise with complex cross-domain integration requirements: consider a hybrid where the staging layer uses Inmon-style normalized integration and the marts layer uses Kimball-style dimensional models. Data Vault if you are in a regulated industry with strict audit requirements.
Team with significant unstructured or semi-structured data, or ML workloads alongside analytics: lakehouse architecture. The ability to serve both SQL analytics and ML pipelines from the same storage layer is the key advantage.
Team that values managed services over flexibility: Snowflake or BigQuery as the warehouse, with dbt for transformations. You trade some flexibility for significantly reduced operational burden. This is the right choice for most teams that do not have dedicated infrastructure engineers.
In practice, most modern data platforms are a hybrid: a managed cloud warehouse for the transformation and serving layer, with object storage for raw data archival and ML workloads. The architecture debate is less about choosing one camp and more about understanding what each approach contributes and applying the relevant principles to your actual situation.
Found this useful? Share it: