Blog
Data Governance in Practice: The Parts That Actually Work
Ryan Kirsch · December 22, 2025 · 9 min read
Most data governance frameworks fail for the same reason: they are designed as compliance exercises rather than engineering infrastructure. They create documentation that goes stale, approval processes that slow teams down, and ownership assignments that no one enforces. The governance that actually works treats accountability, access control, and documentation as natural outputs of building the platform well.
The Problem with Traditional Governance
Traditional data governance programs typically involve: a data catalog that requires manual entry, an ownership matrix that lives in a spreadsheet, an approval process for data access that takes days, and a data dictionary that was accurate when written and has not been touched since.
None of these mechanisms are self-maintaining. They all require human effort to keep current, and that effort competes with the engineering work that produces business value. When the choice is between updating a data dictionary and building the next pipeline, the dictionary loses every time.
The alternative: build governance into the tools and workflows that engineers use every day. If owning a table means adding a meta block to a dbt YAML file that is already being edited, ownership gets maintained. If documenting a model means writing the description before the PR is merged, documentation gets written. Make the right behavior the path of least resistance.
Data Ownership: Making It Real
Data ownership is only meaningful if it comes with responsibilities. A table that has an “owner” field in a spreadsheet but no actual accountability is not owned -- it is labeled. Real ownership means someone is on-call for that table, reviews schema change requests, and is notified when quality checks fail.
The dbt implementation that makes ownership functional:
# models/marts/finance/fct_monthly_revenue.yml
models:
- name: fct_monthly_revenue
description: >
Monthly revenue aggregated from delivered orders.
One row per calendar month. Source of truth for Finance
team MRR dashboard and board reporting.
meta:
owner: "data-platform@company.com"
team: "data-platform"
tier: "tier-1" # Tier 1 = business-critical, monitored 24/7
consumers:
- "finance-dashboard"
- "board-reporting-pipeline"
- "monthly-close-process"
sla:
freshness_hours: 6 # Must refresh within 6h of schedule
uptime_pct: 99.5
pii: false
data_classification: "internal"
columns:
- name: month
description: "First day of the calendar month (e.g., 2026-03-01)"
data_type: date
- name: revenue_usd
description: "Sum of delivered order amounts, net of refunds, in USD"
data_type: numericWith this structure in dbt, ownership is version-controlled, searchable, and co-located with the model it describes. When you run dbt docs generate, the owner and consumer information is automatically included in the catalog. No separate spreadsheet required.
Access Control: Role-Based and Enforced
Access control is governance that engineering teams take seriously, because the consequences of getting it wrong are visible (security incidents, compliance failures) rather than invisible (stale documentation). The implementation in Snowflake:
-- Role hierarchy for data access control
-- ACCOUNTADMIN
-- SYSADMIN
-- DATA_PLATFORM_ADMIN (data engineers)
-- ANALYTICS_READ (analysts: read prod data)
-- ANALYTICS_WRITE (data engineers: write to silver/gold)
-- STAGING_WRITE (ingestion pipelines: write to bronze/staging)
-- Create role hierarchy
CREATE ROLE IF NOT EXISTS DATA_PLATFORM_ADMIN;
CREATE ROLE IF NOT EXISTS ANALYTICS_READ;
CREATE ROLE IF NOT EXISTS ANALYTICS_WRITE;
CREATE ROLE IF NOT EXISTS STAGING_WRITE;
GRANT ROLE ANALYTICS_READ TO ROLE ANALYTICS_WRITE;
GRANT ROLE ANALYTICS_WRITE TO ROLE DATA_PLATFORM_ADMIN;
GRANT ROLE DATA_PLATFORM_ADMIN TO ROLE SYSADMIN;
-- Grant schema-level access
GRANT USAGE ON DATABASE ANALYTICS TO ROLE ANALYTICS_READ;
GRANT USAGE ON SCHEMA ANALYTICS.GOLD TO ROLE ANALYTICS_READ;
GRANT SELECT ON ALL TABLES IN SCHEMA ANALYTICS.GOLD TO ROLE ANALYTICS_READ;
GRANT SELECT ON FUTURE TABLES IN SCHEMA ANALYTICS.GOLD TO ROLE ANALYTICS_READ;
-- Column masking for PII (Snowflake Enterprise+)
CREATE OR REPLACE MASKING POLICY pii_mask AS (val STRING)
RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() IN ('DATA_PLATFORM_ADMIN') THEN val
ELSE '****'
END;
ALTER TABLE dim_customers
MODIFY COLUMN email SET MASKING POLICY pii_mask;The key design decision: grant access at the role level, not the user level. Adding a new analyst means assigning them the ANALYTICS_READ role -- one change, all the right permissions, no one-off grants that are forgotten and accumulate.
Lineage: The Governance That Pays for Itself
Data lineage -- knowing what data came from where and what depends on it -- is the governance feature that pays for itself most directly. When a source system changes, lineage tells you exactly which downstream models are affected. When a table has a quality issue, lineage tells you which dashboards are consuming it.
dbt generates lineage automatically from its ref() and source() calls. The output is a directed acyclic graph that can be queried:
# Query dbt manifest for upstream dependencies of a model
import json
from pathlib import Path
def get_upstream_models(manifest_path: str, model_name: str) -> list[str]:
"""Returns all models that model_name depends on, recursively."""
with open(manifest_path) as f:
manifest = json.load(f)
target = f"model.myproject.{model_name}"
visited = set()
queue = [target]
while queue:
current = queue.pop(0)
if current in visited:
continue
visited.add(current)
node = manifest.get("nodes", {}).get(current, {})
parents = node.get("depends_on", {}).get("nodes", [])
queue.extend(parents)
visited.discard(target)
return [n.split(".")[-1] for n in visited]
# Usage: find everything fct_monthly_revenue depends on
upstream = get_upstream_models("target/manifest.json", "fct_monthly_revenue")
print(f"fct_monthly_revenue depends on {len(upstream)} upstream models")Combine lineage with ownership metadata and you have a complete impact analysis tool: for any source system change, identify all affected models and automatically notify their owners.
A Data Dictionary That Stays Current
The fastest path to a current data dictionary is to use the documentation already living in your dbt YAML files. dbt docs generates a static site from your model descriptions, column descriptions, and test definitions. It is automatically regenerated with every dbt run, which means it is never more than one pipeline run out of date.
# Generate and serve docs locally
dbt docs generate
dbt docs serve --port 8080
# Deploy docs to a static host (in CI/CD)
dbt docs generate
aws s3 sync target/docs s3://company-data-docs/ --delete
# Or: upload to Netlify, Vercel, GitHub Pages
# The output includes:
# - Model descriptions with owners and SLAs
# - Column descriptions and data types
# - Test coverage per model
# - Interactive lineage graph
# - Source system documentationThe enforcement mechanism: make adding a description mandatory in the dbt CI check before a model can be merged to main. A PR that adds a new production model without a description fails CI. This is governance that requires no human reviewer -- the tool enforces it automatically.
PII and Data Classification
PII handling is the governance area where mistakes have the most serious consequences. The practical approach:
- Classify at ingestion. Tag PII columns in the staging layer YAML before they enter the transformation pipeline. This makes PII status visible to every model that references the column.
- Never join PII into gold tables unnecessarily.A gold model that contains customer email addresses when it only needs customer segments has unnecessarily expanded the PII surface. Use customer IDs in gold and join to PII only at the serving layer, where access control is explicit.
- Audit PII access quarterly. Who has SELECT access to tables with PII? Are all those grants still necessary? This audit is simple to run and catches access grants that were provisioned for a project and never removed.
-- Quarterly PII access audit query
SELECT
g.grantee_name,
g.privilege,
g.table_name,
g.table_schema
FROM INFORMATION_SCHEMA.OBJECT_PRIVILEGES g
WHERE g.privilege = 'SELECT'
AND g.table_name IN (
-- Tables with PII columns
SELECT table_name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE comment ILIKE '%pii%'
OR column_name IN ('email', 'phone', 'ssn', 'dob', 'address')
)
ORDER BY g.grantee_name, g.table_name;The Governance That Engineering Teams Will Actually Follow
The common thread in all of these practices: they produce governance artifacts as side effects of work that engineers are already doing, rather than requiring separate effort.
- Writing a dbt model requires a YAML file -- put ownership there
- Merging a PR can require a description -- enforce it in CI
- dbt runs generate docs -- deploy them automatically
- Granting access requires a role -- make the role hierarchy explicit so every grant is deliberate
The governance programs that fail are the ones that require engineers to do extra work with no immediate benefit to their pipelines. The ones that succeed are woven into the tools and processes that engineers already trust.
Ryan Kirsch
Senior Data Engineer with experience building production pipelines at scale. Works with dbt, Snowflake, and Dagster, and writes about data engineering patterns from production experience. See his full portfolio.