← Back to home

Blog

Practical notes on data engineering, local-first tooling, and building systems that feel production-grade without the cloud bill.

Databricks for Data Engineers: What You Need to Know

March 20, 2026

A practical guide to Databricks: Delta Lake ACID transactions and time travel, Unity Catalog three-level namespacing and lineage, Structured Streaming with Auto Loader, SQL Warehouses for dbt, and when Databricks is the right choice versus a warehouse.

The Soft Skills That Make Data Engineers Irreplaceable

February 20, 2026

Technical skills get you hired. These determine how far you go: translating data concepts for business stakeholders without losing them, pushing back on bad requirements constructively, making invisible infrastructure work visible, and giving estimates that are actually reliable.

Writing SQL for Data Pipelines: Patterns That Scale

February 18, 2026

Pipeline SQL has different constraints than analytical SQL: it runs on a schedule, gets called with different inputs, and its failures are silent. CTE layering as a debugging affordance, window functions for sessionization and SCD Type 2, idempotent incremental patterns, and the anti-patterns that produce wrong answers quietly.

Data Engineering Skills That Actually Matter in 2026

February 15, 2026

What the data engineering job market actually rewards in 2026: SQL fluency that goes beyond syntax, system design reasoning that anticipates failure, and the difference between tools listed in job postings and skills actually probed in senior interviews.

dbt in Production: The Patterns That Scale

February 11, 2026

Advanced dbt patterns for projects that live past 50 models: source-scoped staging structure, custom schema generation for environments, macro libraries, slim CI with defer and state, and the testing philosophy that actually gets followed.

Data Lake Architecture: From Swamp to Lakehouse

February 9, 2026

How data lakes become swamps and how to prevent it: open table formats (Iceberg vs. Delta Lake), folder structure conventions, AWS Glue catalog, the lakehouse architecture pattern, and how to serve multiple compute engines from a single storage layer.

Async Python for Data Engineering: When and How to Use It

February 1, 2026

Fetching 500 API endpoints sequentially takes 500x longer than it needs to. A practical guide to async Python with asyncio and aiohttp: concurrent API ingestion, semaphore rate limiting, async database drivers, Dagster integration, and when async is the wrong tool.

Testing Data Pipelines with Python: A Practical Guide

January 28, 2026

Unit tests for transformation logic, integration tests with in-memory DuckDB, pytest fixtures for reusable test infrastructure, and property-based testing with Hypothesis. A practical framework for pipeline test coverage that actually catches bugs.

PostgreSQL for Data Engineers: Beyond Basic Queries

January 16, 2026

PostgreSQL patterns that matter for data engineering: window functions, CTEs, JSONB for semi-structured data, table partitioning, EXPLAIN ANALYZE for query diagnosis, and a clear view of when Postgres is the right tool versus when to reach for a dedicated warehouse.

dbt Exposures: Documenting Downstream Dependencies

January 7, 2026

Most dbt documentation stops at models. Exposures document the dashboards, notebooks, and applications that actually depend on them. This guide shows how to define exposures in YAML, use them for impact analysis before refactors, integrate them with catalogs, and roll them out without turning metadata into busywork.

Kafka Consumer Group Patterns for High-Throughput Pipelines

January 5, 2026

Consumer groups are where throughput dies: rebalances, hot partitions, lag cliffs, and sloppy commits. This post is a production playbook for scaling Kafka consumers with cooperative rebalancing, partition strategy, lag analysis, and commit discipline, plus Python patterns that survive real traffic.

Data Lineage and Catalog Tools: The Practical Comparison for 2026

January 4, 2026

Every data team eventually wants a data catalog. The practical decision framework: when dbt docs are enough, when open-source tools like DataHub or OpenMetadata are worth the operational overhead, when a commercial catalog like Atlan makes sense, and the mistake that makes every catalog useless -- not maintaining it.

Infrastructure as Code for Data Engineers: Terraform Patterns for Data Platforms

January 2, 2026

Data platform configuration accumulates drift. Someone creates a warehouse, forgets auto-suspend, and six months later it's still running. Terraform patterns for data engineers: Snowflake warehouses, schemas, and RBAC as code; S3 lifecycle policies; state management for teams; and the CI/CD workflow that makes infrastructure changes reviewable -- same discipline as application code.

API Design for Data Engineers: Building Reliable Data Ingestion Endpoints

December 31, 2025

Some APIs are a pleasure to pipeline. Others are a nightmare. A data engineer's perspective on what makes the difference: cursor vs offset pagination (offset is fragile for live data), rate limit handling with Retry-After, idempotency keys for safe retries, webhook reliability patterns, and the incremental API design that reduces ingestion cost by orders of magnitude.

Spark vs. dbt: When to Use Each for Large-Scale Data Transformations

December 26, 2025

Spark or dbt? Usually both -- applied to the workloads each handles best. A practical decision framework: dbt for SQL-expressible analytics modeling with lineage and docs; Spark for procedural logic, petabyte-scale economics, ML features, and streaming. The architecture that stitches both into a coherent platform, the common mistakes, and the PySpark write pattern that makes dbt consumption predictable.

BigQuery for Data Engineers: Architecture, Optimization, and When to Use It

December 24, 2025

BigQuery charges per byte scanned, not per compute second -- which changes everything about how you optimize. A data engineer's guide to BigQuery: serverless architecture vs Snowflake, partition + cluster strategies for cost control, require_partition_filter as a safeguard, dbt configuration, BigQuery-specific SQL patterns (STRUCT, UNNEST, MERGE), and when to choose BigQuery vs Snowflake.

Data Governance in Practice: The Parts That Actually Work

December 22, 2025

Most governance programs fail because they require separate effort with no immediate engineering benefit. A practical guide to governance that actually sticks: ownership in dbt YAML, role-based access control with column masking, lineage from manifest.json, a data dictionary that auto-updates with every pipeline run, and PII classification at ingestion.

Data Platform Cost Optimization: Reducing Cloud Spend Without Sacrificing Reliability

December 21, 2025

Data platform costs scale faster than teams expect. The specific levers that move the needle: cost visibility queries before you optimize anything, warehouse auto-suspend and resource monitors, Time Travel storage tuning, the query patterns that generate disproportionate cost, dbt incremental vs. table materialization impact, and how to build cost culture before the bill surprises you.

Data Engineering Interview Questions: What Senior Roles Actually Ask

December 14, 2025

The interview questions that actually separate senior DE candidates: system design probes (what questions you ask before proposing architecture), pipeline failure debugging scenarios, SQL edge cases that reveal production instincts, behavioral questions about pushing back on stakeholders, and the questions you should ask the interviewer.

Apache Airflow in Production: Lessons from Running It at Scale

December 12, 2025

The Airflow docs teach you to write a DAG. They don't explain why your scheduler is crawling, why tasks zombie, or what retry config prevents 3 AM incidents. Hard-won lessons: scheduler bottlenecks, operator selection (PythonOperator vs. KubernetesPodOperator), safe retry patterns, XCom limits, secrets management, and the workflows where Airflow is the wrong tool entirely.

Getting to Senior Data Engineer: The Skills Interviewers Actually Test

December 10, 2025

Every DE above junior claims the same tools. What actually separates senior from mid-level in interviews: system design thinking that starts with trade-offs, production mindset (idempotency, explicit failure modes, volume alerts), the ability to say no intelligently, and how to tell technical stories that lead with business impact.

dbt Best Practices for Senior Data Engineers: Beyond the Tutorial

November 30, 2025

The dbt patterns that separate senior engineers from analysts who learned it last year. Project structure decisions that compound, incremental model strategy that won't blow up in production, data contracts for cross-team governance, and the meta-skills -- naming discipline, deprecation, documentation as design review -- that keep a project maintainable at scale.

Data Observability: How to Know When Your Pipeline Is Lying to You

November 28, 2025

A pipeline that fails loudly is easy to fix. The dangerous one succeeds silently while delivering wrong data. Practical guide to data observability: freshness checks, volume anomaly detection, schema change alerts, lineage tracking, and tool selection -- so you find problems before your stakeholders do.