Data Lineage in Practice: How to Know What Breaks When You Change a Model
Data lineage sounds like governance theater until the day a simple model change breaks six dashboards, two reverse ETL syncs, and an executive KPI. Then it becomes the thing you wish you had invested in sooner.
Most data teams first encounter lineage as a static graph in dbt docs or a metadata tool demo. It looks useful, but not urgent. The graph becomes urgent when a column rename silently breaks a downstream Looker explore, or when an incident starts with “numbers look wrong” and nobody knows which upstream transform changed.
Lineage is not a deliverable for auditors. It is an operational tool for impact analysis, incident response, code review, onboarding, and stakeholder trust. When implemented well, it changes how a data team ships work.
What Lineage Actually Means
At the simplest level, lineage answers two questions:
- Where did this table or column come from?
- What depends on it downstream?
There are two main levels of lineage that matter in practice.
Table-level lineage shows that model A feeds model B, which feeds dashboard C. This is enough for broad impact analysis and understanding the shape of the transformation graph.
Column-level lineage shows that customer_lifetime_value on the final mart depends on order_amount, refund_amount, and account_created_at upstream. This is what you need when individual fields change or when metric trust becomes political.
raw.stripe_charges.amount → stg_stripe__charges.charge_amount → int_orders__net_revenue.net_amount → fct_orders.revenue → mart_customer_ltv.customer_lifetime_value → dashboard.executive_revenue_overview.total_ltv # Table-level tells you the path. # Column-level tells you the specific field logic involved.
Why dbt Lineage Matters So Much
dbt made lineage dramatically more accessible because the dependency graph is implicit in the model code. If a model references another model via ref(), dbt can construct the DAG automatically.
-- models/marts/fct_orders.sql
with orders as (
select * from {{ ref('int_orders__clean') }}
),
customers as (
select * from {{ ref('dim_customers') }}
)
select
o.order_id,
o.customer_id,
c.segment,
o.net_revenue,
o.order_date
from orders o
left join customers c
on o.customer_id = c.customer_idThat means every code change in dbt has built-in structural lineage. The catch is that teams often stop there. dbt lineage is powerful, but it only covers the dbt layer. The moment data flows into BI tools, reverse ETL tools, ML feature stores, or external APIs, you need more than the dbt graph to understand the full blast radius of a change.
Where Lineage Comes From
Useful lineage usually comes from combining several sources:
- Transformation code: dbt refs, Spark job configs, SQL files in version control
- Warehouse metadata: query history, table dependencies, view definitions
- BI metadata: dashboard queries, semantic model fields, metric definitions
- Ingestion metadata: connector mappings, CDC streams, load jobs
Tools like DataHub, OpenMetadata, Atlan, and Monte Carlo aggregate these sources to build a more complete lineage graph. Warehouse-native features can help too. Snowflake query history, BigQuery INFORMATION_SCHEMA views, and Databricks Unity Catalog all expose useful metadata that can be harvested for dependency analysis.
Impact Analysis Before You Merge
The most valuable use of lineage is not post-hoc documentation. It is pre-merge impact analysis.
Before changing a model, ask:
- Which downstream tables depend on this model?
- Which dashboards read those tables?
- Are any reverse ETL jobs or ML features sourced from these fields?
- Is this a table-level change, a column-level change, or a semantic change with the same schema?
The schema-preserving semantic change is the dangerous one. Renaming a column usually throws a failure. Changing the logic of a familiar metric without changing the column name produces silent inconsistency. Good lineage paired with metric ownership makes that kind of change visible before it hits production.
# Example review checklist Changed model: mart_customer_ltv Downstream dependents: - dashboard.executive_revenue_overview - reverse_etl.salesforce_account_health_sync - notebook.finance_forecast_q2 Columns changed semantically: - customer_lifetime_value - average_order_value Required reviewers: - Finance analytics owner - RevOps owner - Data platform reviewer
Lineage During Incidents
Incident response gets much faster when lineage is available. Without it, the investigation pattern is guesswork: inspect the dashboard, find the source table, inspect that table, guess which upstream transform might be wrong, repeat. With lineage, the search space narrows immediately.
A practical incident workflow looks like this:
- Start with the broken metric or dashboard.
- Traverse upstream to identify the immediate source model.
- Check recent code changes on that model and its parents.
- Compare row counts, freshness, and distribution shifts at each hop.
- If needed, traverse downstream from the root issue to identify all impacted assets for communication.
That last step matters. Good lineage is not just about finding the root cause. It is about knowing who to notify and what else may already be wrong.
Column-Level Lineage Is Hard, But Worth It
Column-level lineage is more expensive because SQL transformations are not always simple projections. A derived metric may be built from nested CTEs, window functions, macros, and UDFs. Parsing that reliably across dialects is not trivial.
That said, even partial column-level lineage is valuable. If your tools can identify direct field mappings and common transformations, you already gain much of the impact analysis benefit for common changes. Perfect lineage is not required for lineage to be useful.
The Cultural Shift
The most mature use of lineage is cultural, not technical. Teams start using it automatically in code reviews, onboarding, planning, and postmortems.
New engineers use lineage to understand the platform faster. Reviewers use it to ask sharper questions. Product and analytics stakeholders trust the data team more because changes are communicated with confidence instead of guesses.
The real payoff is not the graph itself. It is the operational habit it enables: nobody changes a model blindly, and nobody investigates a data incident from scratch.
Found this useful? Share it: