Data Modeling for Data Engineers: Dimensional, OBT, and When to Use Each

Data modeling is one of those skills that data engineers often absorb informally, through osmosis and legacy schemas, rather than studying deliberately. You inherit a star schema or a flat wide table, learn its conventions, and ship on top of it. The problem surfaces when you need to build from scratch or make a deliberate architectural decision, and you realize you have opinions but not frameworks.

This post covers the main modeling patterns you will encounter as a senior data engineer: dimensional modeling, the one big table approach, data vault, and entity-centric models. More importantly, it covers how to choose between them when the choice is yours to make.

Dimensional Modeling: The Classic

Dimensional modeling, introduced by Ralph Kimball in the 1990s, remains the dominant pattern for analytical data warehouses. The core structure is the star schema: a central fact table surrounded by dimension tables.

Fact tables contain the measurements you care about, things like revenue, page views, or order quantities, along with foreign keys to dimensions. Dimension tables contain the context: who, what, where, when. A simple example:

-- Fact table
CREATE TABLE fct_orders (
  order_id STRING,
  customer_id STRING,  -- FK to dim_customers
  product_id STRING,   -- FK to dim_products
  date_id STRING,      -- FK to dim_date
  revenue NUMERIC,
  quantity INTEGER,
  discount_amount NUMERIC
);

-- Dimension table
CREATE TABLE dim_customers (
  customer_id STRING,
  customer_name STRING,
  email STRING,
  segment STRING,
  country STRING,
  city STRING,
  acquisition_channel STRING
);

The star schema optimizes for analytical queries. Joins are simple (fact to dimension, never dimension to dimension in a star schema), query plans are predictable, and business users can navigate the model intuitively because it mirrors how they think about their business.

The snowflake schema normalizes dimensions further, splitting them into sub-dimensions. A dim_customers table might reference a dim_segments table rather than storing segment names inline. This reduces storage and avoids update anomalies, but it adds join complexity and is generally the wrong tradeoff in a cloud data warehouse where storage is cheap and join performance is high.

Dimensional modeling works well when: your queries are primarily analytical aggregations, your team includes analysts who will write SQL directly against the model, your fact grain is well-defined (one row per order, one row per event), and your dimensions are relatively stable.

Slowly Changing Dimensions

One of the most important concepts in dimensional modeling is the slowly changing dimension (SCD), which describes how to handle records in dimension tables that change over time.

SCD Type 1 overwrites the old value. If a customer moves from New York to Chicago, you update the row. No history. This is appropriate when historical accuracy is not needed or when the change was an error.

SCD Type 2 tracks history by creating a new row for each change, with effective date ranges and an active flag. The customer gets two rows: one for the New York period and one for the Chicago period.

CREATE TABLE dim_customers (
  customer_sk INTEGER,        -- surrogate key
  customer_id STRING,         -- natural key
  customer_name STRING,
  city STRING,
  effective_from DATE,
  effective_to DATE,          -- NULL if current
  is_current BOOLEAN
);

SCD Type 2 is the most common approach for dimensions that matter historically. It is also the most complex to maintain. In dbt, the snapshots feature handles SCD Type 2 automatically using the check or timestamp strategy.

SCD Type 3 adds a new column for the previous value. This trades flexibility for simplicity: you get one version of history, not full history. Use it when you only need to track a single previous state and do not need arbitrary point-in-time lookups.

One Big Table: When Denormalization Wins

The one big table (OBT) pattern is exactly what it sounds like: a wide, heavily denormalized table that pre-joins everything into a single flat structure. Instead of a fact table referencing dimension tables, you materialize all the relevant context directly into the fact rows.

-- OBT: everything in one row
SELECT
  o.order_id,
  o.revenue,
  o.order_date,
  c.customer_name,
  c.segment,
  c.country,
  p.product_name,
  p.category,
  p.brand
FROM fct_orders o
JOIN dim_customers c USING (customer_id)
JOIN dim_products p USING (product_id)

OBT works well for machine learning feature tables, where models need wide flat input rather than normalized references. It is also appropriate for BI tools that cannot easily perform multi-table joins, or for teams where the primary consumers are analysts who do not write SQL and rely on drag-and-drop dashboards.

The downsides of OBT are real. Storage increases because dimension values are repeated across every row. Updates become expensive because changing a dimension value requires updating every row that references it, not just the dimension table. Query flexibility decreases because the table is pre-joined to specific dimensions, and ad-hoc questions that need different context require a different table or a rewrite.

In practice, OBT and dimensional models coexist. The gold layer in a medallion architecture often contains both: star schema tables for exploratory analysis and OBT tables optimized for specific BI dashboards or ML pipelines.

Data Vault: For Auditability and Flexibility

Data vault is a modeling methodology designed for enterprise data warehouses where auditability, source tracking, and schema flexibility matter more than query performance. The three building blocks are hubs, links, and satellites.

Hubs contain unique business keys with no other attributes. A hub_customer contains only customer IDs plus load metadata. Links represent relationships between hubs. A link_order_customer connects orders to customers. Satellites contain the descriptive attributes for hubs or links, with full history and load timestamps.

Data vault is rarely the right choice for a startup or a team building their first data warehouse. It adds significant structural complexity in exchange for benefits that matter in regulated industries (finance, healthcare, government) where you need to prove exactly which source system provided which data at which point in time.

If you are inheriting or working in a data vault environment, the key skill is understanding the business vault layer, which typically contains pre-joined business-friendly views built on top of the raw vault. Most analysts and data consumers never touch the raw vault directly.

Entity-Centric Models

Entity-centric modeling organizes data around core business entities (customers, products, users, accounts) rather than around events or facts. Each entity gets a single table that aggregates everything known about it.

-- Entity-centric customer table
CREATE TABLE entity_customers AS
SELECT
  customer_id,
  -- Identity
  email,
  name,
  -- Acquisition
  first_order_date,
  acquisition_channel,
  -- Behavior (aggregated)
  total_orders,
  lifetime_revenue,
  avg_order_value,
  days_since_last_order,
  -- Segments
  rfm_segment,
  churn_risk_score,
  -- Meta
  last_updated_at
FROM (computed from various source tables);

Entity-centric models work exceptionally well for operational analytics use cases where the question is always "what do we know about this entity right now." They are common in marketing analytics, customer success tooling, and product analytics. They are less useful for time-series analysis or when you need to examine the events that led to the current state.

The dbt Layer Architecture

In modern data stacks, modeling patterns are not mutually exclusive. They map to layers in the transformation pipeline:

Staging layer: source-aligned models, one per source table, minimal transformation. Rename columns, cast types, standardize nulls. No joins across sources.

Intermediate layer: business logic, joins within a domain. This is where SCD logic, sessionization, or complex event processing happens.

Marts layer: consumer-ready. This is where you choose your modeling pattern based on the consumer. Star schema for general analytics, OBT for specific dashboards, entity tables for operational use cases.

models/
  staging/
    stg_orders.sql
    stg_customers.sql
    stg_products.sql
  intermediate/
    int_orders_with_refunds.sql
    int_customer_sessions.sql
  marts/
    core/
      fct_orders.sql         # fact table
      dim_customers.sql      # SCD type 2
      dim_products.sql
    analytics/
      obt_orders_enriched.sql  # OBT for BI tool
    product/
      entity_customers.sql   # entity-centric

Grain: The Most Important Decision You Make

Before choosing a modeling pattern, define your grain. The grain is the precise definition of what one row in your fact or entity table represents.

Getting grain wrong is one of the most expensive modeling mistakes you can make. If you mix grains in a single table (some rows represent orders, some represent order line items), your aggregations will be wrong in ways that are hard to detect until a stakeholder finds a discrepancy that embarrasses everyone.

Always write down the grain in your model description before writing any SQL. For a fact table: "One row per order line item, defined as a unique combination of order_id and product_id." For an entity table: "One row per customer, representing the current state as of the last refresh."

How to Choose

The practical decision matrix:

Use dimensional modeling (star schema) when your primary consumers are analysts writing SQL, when you need point-in-time historical accuracy via SCD, and when the fact table grain is clean and well-defined.

Use OBT when your consumers are BI tools with limited join capabilities, when you are building ML feature tables, or when a specific high-traffic dashboard would benefit from a pre-joined structure.

Use data vault when you are in a regulated environment with strict audit requirements, when you have multiple source systems with conflicting natural keys, or when schema evolution is frequent and you cannot afford to break downstream consumers.

Use entity-centric when your primary question is about current entity state, when the consumers are operational tools rather than analytical dashboards, or when you are building a customer 360 or product analytics foundation.

In practice, most mature data platforms use all four. The staging and intermediate layers do the heavy lifting, and the marts layer serves different modeling patterns to different consumers. The skill is in knowing which pattern serves which consumer, and resisting the urge to pick one pattern and apply it universally.

The Conversation You Need to Have First

Before you design a schema, answer these questions with your stakeholders: How will this data be queried? By whom? With what tool? How often does the data change? Do we need historical accuracy or just current state? What are the most common aggregate patterns? What does a "wrong answer" look like and how would we catch it?

Modeling decisions made without answers to these questions tend to get revisited in six months when the business evolves and the schema no longer fits. The conversations are annoying to have upfront and far more expensive to skip.