dbt Best Practices for Senior Data Engineers: Beyond the Tutorial

Project Structure: The Decision That Compounds

The structure you choose in week one compounds into either clarity or chaos by month six. The standard dbt project layout works well if you apply it deliberately:

models/
  staging/          # One-to-one with source tables. Minimal transforms.
    salesforce/
      stg_sf_accounts.sql
      stg_sf_opportunities.sql
    stripe/
      stg_stripe_charges.sql
  intermediate/     # Business logic that isn't ready for gold.
    int_revenue_attribution.sql
    int_customer_lifecycle.sql
  marts/            # Business-facing. Organized by domain.
    finance/
      fct_monthly_revenue.sql
      dim_customers.sql
    product/
      fct_user_activity.sql
      fct_feature_adoption.sql
  core/             # Shared dimensions used across marts.
    dim_date.sql
    dim_geography.sql

The staging layer is the most frequently violated. Staging models should do exactly four things: rename columns to consistent conventions, cast types, add a surrogate key if needed, and deduplicate. If your staging model has a JOIN or a business condition filter, you have promoted business logic into the source layer, and now it is load-bearing in ways you will regret.

The intermediate layer is where senior engineers earn their keep. Most teams skip it entirely, pushing everything from staging directly into marts. The result is mart models with 8 CTEs and 150 lines of SQL that no one wants to touch. Intermediate models are the place for shared business logic -- revenue attribution rules, customer lifecycle calculations, funnel definitions -- that multiple marts consume. They also make the mart models readable.

Incremental Models: Getting the Strategy Right

Incremental models are the source of most dbt production incidents. The configuration looks simple; the edge cases are not.

The three strategies and when to use each:

-- Strategy 1: append_only
-- Use when: immutable event data (clicks, logs, impressions)
-- Risk: will duplicate rows if run twice in the same window
{{ config(
    materialized='incremental',
    incremental_strategy='append',
    unique_key=none,
) }}

-- Strategy 2: delete+insert (default on most warehouses)
-- Use when: events that can update (order status changes, etc.)
-- Risk: full partition scan on large tables
{{ config(
    materialized='incremental',
    incremental_strategy='delete+insert',
    unique_key='order_id',
    partition_by={'field': 'order_date', 'data_type': 'date'},
) }}

-- Strategy 3: merge
-- Use when: slowly changing dimensions, upsert patterns
-- Risk: requires warehouse support; most expensive per row
{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='customer_id',
    merge_update_columns=['status', 'updated_at', 'email'],
) }}

The unique_key setting is the most commonly misconfigured. When you set a unique_key, dbt issues a DELETE for matching records before inserting. On a table with 500M rows and no partition filter, this is a full-table scan. Always pair a unique_key with a partition filter or an incremental predicate that limits the DELETE scope:

-- Good: limit the incremental window to recent partitions
{% if is_incremental() %}
WHERE order_date >= (
    SELECT DATE_SUB(MAX(order_date), INTERVAL 3 DAY)
    FROM {{ this }}
)
{% endif %}

-- Why 3 days instead of 1: late-arriving data buffer.
-- If your source delivers late events up to 48 hours after
-- the event time, a 1-day lookback will miss them.
-- Size the window to your actual late-arrival SLA.

On schema_change: set it to sync_all_columns in development and fail in production. In production, you want to know when upstream source schemas change, not silently adapt to them.

Testing Philosophy: What to Test and Why

Most dbt projects under-test or test the wrong things. The goal of dbt tests is not to achieve coverage. It is to catch the specific failures that would reach stakeholders undetected.

Tests by layer and purpose:

# Staging layer: test the source contract
# Goal: catch upstream schema changes early
models:
  - name: stg_stripe_charges
    columns:
      - name: charge_id
        tests: [unique, not_null]
      - name: amount_cents
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"
      - name: status
        tests:
          - accepted_values:
              values: ['succeeded', 'pending', 'failed', 'refunded']

# Mart layer: test business logic
# Goal: catch transformation bugs and business rule violations
models:
  - name: fct_monthly_revenue
    tests:
      # Total revenue should never decrease month-over-month by >50%
      # (catches accidental data loss in incremental)
      - dbt_utils.expression_is_true:
          expression: "revenue_usd >= 0"
    columns:
      - name: month
        tests: [unique, not_null]
      - name: revenue_usd
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "> 0"

Avoid the trap of testing everything with not_null and unique. Those are useful, but they tell you nothing about correctness. A row count comparison between source and staging -- a reconciliation test -- catches far more real bugs:

-- tests/reconcile_staging_orders.sql
-- Fail if staging drops >0.1% of source rows
SELECT
    source_count,
    staging_count,
    ABS(source_count - staging_count) / NULLIF(source_count, 0) AS drop_rate
FROM (
    SELECT COUNT(*) AS source_count FROM {{ source('raw', 'orders') }}
) s
CROSS JOIN (
    SELECT COUNT(*) AS staging_count FROM {{ ref('stg_orders') }}
) t
WHERE drop_rate > 0.001

Data Contracts and Cross-Team Governance

In a single-team dbt project, a model is a model. In a multi-team organization, a model is an API. The difference matters when another team builds a dashboard or a downstream pipeline on your mart models -- any breaking change you make silently breaks their work.

dbt 1.5+ introduced contracts as a first-class feature:

# models/marts/finance/fct_monthly_revenue.yml
models:
  - name: fct_monthly_revenue
    config:
      contract:
        enforced: true  # dbt will error if actual schema != declared schema
    columns:
      - name: month
        data_type: date
        constraints:
          - type: not_null
          - type: unique
      - name: revenue_usd
        data_type: numeric
        constraints:
          - type: not_null
      - name: order_count
        data_type: integer
        constraints:
          - type: not_null

With contract enforcement on, dbt checks at compile time that the SQL produces exactly the declared column names and types. If you rename a column or change a type, the run fails before anything reaches the warehouse. This is the dbt equivalent of typed function signatures -- your mart model cannot silently break its consumers.

For public models that other teams depend on, also set the access level:

# dbt_project.yml
models:
  myproject:
    staging:
      +access: private   # staging models: internal only
    intermediate:
      +access: private   # intermediate: internal only
    marts:
      +access: public    # marts: cross-team consumption OK
      finance:
        +access: public

Performance Patterns Worth Knowing

dbt is a SQL orchestrator, not a query optimizer. The performance work still happens in your warehouse SQL. But there are dbt-specific patterns that significantly affect runtime:

Avoid ref() chains that force sequential runs. If model C depends on B which depends on A, they run in series. Consider whether B and C can both depend directly on A if the intermediate logic is simple. Shorter chains parallelize better.

Use ephemeral models sparingly. Ephemeral models are inlined as CTEs into the models that reference them. This is convenient but creates deeply nested SQL that is hard to debug. Prefer views for shared staging logic that does not need to be materialized.

Pre-hook and post-hook for warehouse optimization:

# Snowflake: cluster the table after building
{{ config(
    materialized='table',
    post_hook=[
        "ALTER TABLE {{ this }} CLUSTER BY (order_date, customer_id)"
    ]
) }}

# BigQuery: partition and cluster
{{ config(
    materialized='table',
    partition_by={'field': 'created_date', 'data_type': 'date'},
    cluster_by=['customer_id', 'product_id'],
) }}

Use dbt ls before large runs. dbt ls --select +model_name shows you every upstream dependency of a model. Running this before a dbt run --select +model_name in production tells you exactly what will execute. No surprises.

The Meta-Skills: What Separates Senior from Mid

Technical dbt knowledge is table stakes at the senior level. The differentiation comes from three meta-skills:

Naming discipline. Model names communicate purpose. fct_ prefix means a fact table (one row per event). dim_ means a dimension (one row per entity). int_ means intermediate business logic.stg_ means a source-aligned staging model. Enforce these conventions in code review, not just documentation. A model named customer_data in a production dbt project is a maintenance problem waiting to happen.

Deprecation strategy. Old models accumulate. A mart model that used to power a dashboard no one uses is still running, still costing compute, still confusing new team members who think it matters. Mark models with meta: flags in their YAML when they are deprecated candidates. Build a quarterly review into your team process. Delete things.

Documentation as a forcing function. Writing a model description forces you to articulate what the model actually does. If you cannot write two clear sentences about what a model produces and who uses it, the model is probably doing too much. Use the documentation requirement as a design review, not just a compliance checkbox.

# Good documentation forces clarity
models:
  - name: fct_monthly_revenue
    description: >
      Monthly revenue aggregated from delivered orders.
      One row per calendar month. Excludes refunded and
      cancelled orders. Source of truth for the Finance
      team's MRR dashboard and board reporting.
      Owner: data-platform@company.com
    meta:
      owner: data-platform
      consumers: [finance-dashboard, board-reporting-pipeline]
      sla: "updated by 6am on the 1st of each month"
      deprecated: false

The Part No Tutorial Covers

The hardest part of senior dbt work is not the SQL or the configuration. It is the organizational work: establishing conventions before the project grows past the point where everyone agrees, getting engineering and analytics to share ownership of the same models, and convincing stakeholders that a slow build with proper testing is faster than a fast build that breaks every time upstream changes.

dbt is a coordination tool as much as a transformation tool. The engineers who get the most out of it are the ones who treat model design as an API design problem -- thinking about downstream consumers, change management, and long-term maintainability before writing the first SELECT.

The SQL is the easy part. Getting the team to agree on what “monthly revenue” means and encoding that definition in a contract that does not break silently -- that is the senior engineer's actual job.