Blog
dbt Best Practices for Senior Data Engineers: Beyond the Tutorial
Ryan Kirsch · November 30, 2025 · 10 min read
The dbt getting-started guide teaches you to write a model, run it, and test it. That covers about 20% of what a senior engineer needs to know. The other 80% -- project architecture, incremental strategy, cross-team contracts, and the meta-skills that keep a project maintainable as it scales -- is learned the hard way. Here is the short path.
Project Structure: The Decision That Compounds
The structure you choose in week one compounds into either clarity or chaos by month six. The standard dbt project layout works well if you apply it deliberately:
models/
staging/ # One-to-one with source tables. Minimal transforms.
salesforce/
stg_sf_accounts.sql
stg_sf_opportunities.sql
stripe/
stg_stripe_charges.sql
intermediate/ # Business logic that isn't ready for gold.
int_revenue_attribution.sql
int_customer_lifecycle.sql
marts/ # Business-facing. Organized by domain.
finance/
fct_monthly_revenue.sql
dim_customers.sql
product/
fct_user_activity.sql
fct_feature_adoption.sql
core/ # Shared dimensions used across marts.
dim_date.sql
dim_geography.sqlThe staging layer is the most frequently violated. Staging models should do exactly four things: rename columns to consistent conventions, cast types, add a surrogate key if needed, and deduplicate. If your staging model has a JOIN or a business condition filter, you have promoted business logic into the source layer, and now it is load-bearing in ways you will regret.
The intermediate layer is where senior engineers earn their keep. Most teams skip it entirely, pushing everything from staging directly into marts. The result is mart models with 8 CTEs and 150 lines of SQL that no one wants to touch. Intermediate models are the place for shared business logic -- revenue attribution rules, customer lifecycle calculations, funnel definitions -- that multiple marts consume. They also make the mart models readable.
Incremental Models: Getting the Strategy Right
Incremental models are the source of most dbt production incidents. The configuration looks simple; the edge cases are not.
The three strategies and when to use each:
-- Strategy 1: append_only
-- Use when: immutable event data (clicks, logs, impressions)
-- Risk: will duplicate rows if run twice in the same window
{{ config(
materialized='incremental',
incremental_strategy='append',
unique_key=none,
) }}
-- Strategy 2: delete+insert (default on most warehouses)
-- Use when: events that can update (order status changes, etc.)
-- Risk: full partition scan on large tables
{{ config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='order_id',
partition_by={'field': 'order_date', 'data_type': 'date'},
) }}
-- Strategy 3: merge
-- Use when: slowly changing dimensions, upsert patterns
-- Risk: requires warehouse support; most expensive per row
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='customer_id',
merge_update_columns=['status', 'updated_at', 'email'],
) }}The unique_key setting is the most commonly misconfigured. When you set a unique_key, dbt issues a DELETE for matching records before inserting. On a table with 500M rows and no partition filter, this is a full-table scan. Always pair a unique_key with a partition filter or an incremental predicate that limits the DELETE scope:
-- Good: limit the incremental window to recent partitions
{% if is_incremental() %}
WHERE order_date >= (
SELECT DATE_SUB(MAX(order_date), INTERVAL 3 DAY)
FROM {{ this }}
)
{% endif %}
-- Why 3 days instead of 1: late-arriving data buffer.
-- If your source delivers late events up to 48 hours after
-- the event time, a 1-day lookback will miss them.
-- Size the window to your actual late-arrival SLA.On schema_change: set it to sync_all_columns in development and fail in production. In production, you want to know when upstream source schemas change, not silently adapt to them.
Testing Philosophy: What to Test and Why
Most dbt projects under-test or test the wrong things. The goal of dbt tests is not to achieve coverage. It is to catch the specific failures that would reach stakeholders undetected.
Tests by layer and purpose:
# Staging layer: test the source contract
# Goal: catch upstream schema changes early
models:
- name: stg_stripe_charges
columns:
- name: charge_id
tests: [unique, not_null]
- name: amount_cents
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
- name: status
tests:
- accepted_values:
values: ['succeeded', 'pending', 'failed', 'refunded']
# Mart layer: test business logic
# Goal: catch transformation bugs and business rule violations
models:
- name: fct_monthly_revenue
tests:
# Total revenue should never decrease month-over-month by >50%
# (catches accidental data loss in incremental)
- dbt_utils.expression_is_true:
expression: "revenue_usd >= 0"
columns:
- name: month
tests: [unique, not_null]
- name: revenue_usd
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "> 0"Avoid the trap of testing everything with not_null and unique. Those are useful, but they tell you nothing about correctness. A row count comparison between source and staging -- a reconciliation test -- catches far more real bugs:
-- tests/reconcile_staging_orders.sql
-- Fail if staging drops >0.1% of source rows
SELECT
source_count,
staging_count,
ABS(source_count - staging_count) / NULLIF(source_count, 0) AS drop_rate
FROM (
SELECT COUNT(*) AS source_count FROM {{ source('raw', 'orders') }}
) s
CROSS JOIN (
SELECT COUNT(*) AS staging_count FROM {{ ref('stg_orders') }}
) t
WHERE drop_rate > 0.001Data Contracts and Cross-Team Governance
In a single-team dbt project, a model is a model. In a multi-team organization, a model is an API. The difference matters when another team builds a dashboard or a downstream pipeline on your mart models -- any breaking change you make silently breaks their work.
dbt 1.5+ introduced contracts as a first-class feature:
# models/marts/finance/fct_monthly_revenue.yml
models:
- name: fct_monthly_revenue
config:
contract:
enforced: true # dbt will error if actual schema != declared schema
columns:
- name: month
data_type: date
constraints:
- type: not_null
- type: unique
- name: revenue_usd
data_type: numeric
constraints:
- type: not_null
- name: order_count
data_type: integer
constraints:
- type: not_nullWith contract enforcement on, dbt checks at compile time that the SQL produces exactly the declared column names and types. If you rename a column or change a type, the run fails before anything reaches the warehouse. This is the dbt equivalent of typed function signatures -- your mart model cannot silently break its consumers.
For public models that other teams depend on, also set the access level:
# dbt_project.yml
models:
myproject:
staging:
+access: private # staging models: internal only
intermediate:
+access: private # intermediate: internal only
marts:
+access: public # marts: cross-team consumption OK
finance:
+access: publicPerformance Patterns Worth Knowing
dbt is a SQL orchestrator, not a query optimizer. The performance work still happens in your warehouse SQL. But there are dbt-specific patterns that significantly affect runtime:
Avoid ref() chains that force sequential runs. If model C depends on B which depends on A, they run in series. Consider whether B and C can both depend directly on A if the intermediate logic is simple. Shorter chains parallelize better.
Use ephemeral models sparingly. Ephemeral models are inlined as CTEs into the models that reference them. This is convenient but creates deeply nested SQL that is hard to debug. Prefer views for shared staging logic that does not need to be materialized.
Pre-hook and post-hook for warehouse optimization:
# Snowflake: cluster the table after building
{{ config(
materialized='table',
post_hook=[
"ALTER TABLE {{ this }} CLUSTER BY (order_date, customer_id)"
]
) }}
# BigQuery: partition and cluster
{{ config(
materialized='table',
partition_by={'field': 'created_date', 'data_type': 'date'},
cluster_by=['customer_id', 'product_id'],
) }}Use dbt ls before large runs. dbt ls --select +model_name shows you every upstream dependency of a model. Running this before a dbt run --select +model_name in production tells you exactly what will execute. No surprises.
The Meta-Skills: What Separates Senior from Mid
Technical dbt knowledge is table stakes at the senior level. The differentiation comes from three meta-skills:
Naming discipline. Model names communicate purpose. fct_ prefix means a fact table (one row per event). dim_ means a dimension (one row per entity). int_ means intermediate business logic.stg_ means a source-aligned staging model. Enforce these conventions in code review, not just documentation. A model named customer_data in a production dbt project is a maintenance problem waiting to happen.
Deprecation strategy. Old models accumulate. A mart model that used to power a dashboard no one uses is still running, still costing compute, still confusing new team members who think it matters. Mark models with meta: flags in their YAML when they are deprecated candidates. Build a quarterly review into your team process. Delete things.
Documentation as a forcing function. Writing a model description forces you to articulate what the model actually does. If you cannot write two clear sentences about what a model produces and who uses it, the model is probably doing too much. Use the documentation requirement as a design review, not just a compliance checkbox.
# Good documentation forces clarity
models:
- name: fct_monthly_revenue
description: >
Monthly revenue aggregated from delivered orders.
One row per calendar month. Excludes refunded and
cancelled orders. Source of truth for the Finance
team's MRR dashboard and board reporting.
Owner: data-platform@company.com
meta:
owner: data-platform
consumers: [finance-dashboard, board-reporting-pipeline]
sla: "updated by 6am on the 1st of each month"
deprecated: falseThe Part No Tutorial Covers
The hardest part of senior dbt work is not the SQL or the configuration. It is the organizational work: establishing conventions before the project grows past the point where everyone agrees, getting engineering and analytics to share ownership of the same models, and convincing stakeholders that a slow build with proper testing is faster than a fast build that breaks every time upstream changes.
dbt is a coordination tool as much as a transformation tool. The engineers who get the most out of it are the ones who treat model design as an API design problem -- thinking about downstream consumers, change management, and long-term maintainability before writing the first SELECT.
The SQL is the easy part. Getting the team to agree on what “monthly revenue” means and encoding that definition in a contract that does not break silently -- that is the senior engineer's actual job.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.