Delta Lake and the Lakehouse Architecture | Ryan Kirsch

The old tradeoff between data warehouses and data lakes is dead. We used to pick between reliability and flexibility. Warehouses gave us clean schemas, updates, and predictable query performance. Lakes gave us cheap storage, open formats, and the ability to land raw data without a schema fight. The lakehouse pattern ends that argument. You can have both, and you can do it without paying the lock in tax.

Delta Lake is the key enabler. It is an open table format built on top of Parquet that adds ACID transactions, versioning, and schema enforcement to a plain data lake. It is not a separate database, it is a transaction layer plus a log. At The Philadelphia Inquirer, that combination turned our S3 lake into something analysts trusted and engineers could evolve without constant backfills.

What Makes Delta Lake Different

Delta Lake solves the part of data lakes that keeps on-call engineers up at night: correctness. It brings ACID transactions to object storage like S3, GCS, and ADLS so you can treat the lake like a real database. You get consistent reads and writes, even with multiple jobs running at the same time.

Time travel is the feature I use most in practice. You can query a table at a specific version or timestamp using VERSION AS OF or TIMESTAMP AS OF. That makes debugging bad data and auditing changes possible without an expensive restore. Delta also enforces schemas and supports schema evolution, which means your pipelines do not explode the first time an upstream team adds a column.

You also get real DML operations on the lake: UPDATE, DELETE, and MERGE for upserts. Under the hood, Delta uses optimistic concurrency control, so concurrent writers do not corrupt state. If two jobs collide, one of them retries instead of producing a half written table.

The MERGE Pattern (Upsert)

The most common pattern in real data engineering is CDC, not full rewrites. If you are processing Kafka events into a Delta table, you want to insert new rows, update existing ones, and leave everything else untouched. A full rewrite can work for small tables, but it gets expensive once you are at tens or hundreds of millions of rows. It also makes backfills painful because you have to load the entire table for a single change.

Here is the pattern I use for a typical Kafka CDC pipeline. You read a micro batch, dedupe on event time, then merge into the target table:

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# latest event per key in this batch
windowed = events.withColumn(
    "row_rank",
    F.row_number().over(
        Window.partitionBy("account_id").orderBy(F.desc("event_ts"))
    )
).filter("row_rank = 1").drop("row_rank")

delta_table = DeltaTable.forPath(spark, "s3://lake/accounts/")

delta_table.alias("target").merge(
    source=windowed.alias("src"),
    condition="target.account_id = src.account_id"
).whenMatchedUpdate(set={
    "status": "src.status",
    "plan": "src.plan",
    "updated_at": "src.event_ts",
}).whenNotMatchedInsert(values={
    "account_id": "src.account_id",
    "status": "src.status",
    "plan": "src.plan",
    "created_at": "src.event_ts",
    "updated_at": "src.event_ts",
}).execute()

This beats a full table rewrite when your updates are a small slice of the overall table, which is the common case in CDC. You pay the cost of rewriting only the files that contain updated rows, and you keep history for free. The merge pattern also scales naturally with streaming ingestion. If your input is a Kafka topic, you can run this as a structured streaming job and commit micro batches into Delta with consistent state.

Time Travel in Practice

Time travel looks like a party trick until you are debugging a bad join at 2 a.m. and need to see what the table looked like yesterday. In Delta, every commit is a version in the transaction log. You can read any version back:

df = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/path/to/table")

The use cases are practical, not theoretical. Debugging bad data is the obvious one. Auditing pipeline changes is another. I have also used time travel to roll back a table after a bad deployment. The Delta log tracks every change as an atomic commit with metadata. It is the reason this system works. You are not mutating a table in place, you are appending a new version and letting the log point to the correct files.

Schema Evolution

Schema enforcement is one of the most underrated features in Delta Lake. It prevents silent corruption when upstream data shifts. If a field changes type, Delta will stop the write. That is annoying in the short term and life saving in the long term.

When you do want evolution, you can opt in. The mergeSchema option lets you add columns safely, which is the common case. Breaking changes still require a deliberate migration. This is the right tradeoff. You should not be able to accidentally change a column from string to struct and keep writing.

Delta Lake vs Iceberg vs Hudi

These three formats are the modern table layer for data lakes. I have used all of them, and the decision usually comes down to your stack and your team.

Delta Lake has the tightest Spark integration and feels most natural in Databricks. The APIs are clean, the docs are solid, and you can move fast without surprises. At The Philadelphia Inquirer, we chose Delta because we were already running Spark jobs and needed reliability more than vendor neutrality.

Iceberg is the most vendor neutral choice. It has strong support in AWS Glue, Trino, and other query engines, which makes it great if you are building a multi engine platform. Hudi shines in streaming heavy pipelines on S3, especially CDC with low latency. If I were starting a pure AWS stack today, I would take a hard look at Iceberg. If I needed streaming merges at scale on S3, I would lean Hudi. For Spark focused teams, Delta is still the smoothest path.

Production Tips

Delta in production looks simple until you run it for months. The big wins come from file management. Use OPTIMIZE for file compaction so you do not drown in small files. Use Z-ORDER to colocate columns you filter on most. Run VACUUM to clean up old files, and keep a retention window of 7 days or more so time travel is still meaningful.

If you are on Databricks, enable auto optimize and auto compaction. It saves you from a whole class of operational issues. These are not optional features in long lived tables. They are the difference between a lakehouse that performs and a swamp that slowly decays.

Closing

The lakehouse is the standard now. Delta Lake is battle tested, and it solves the problems that made data lakes unreliable in the first place. If you are interviewing for a senior data engineer role, learn it deeply. Know how merges work, know how the log works, and know the operational knobs that keep tables healthy. Then go build something real with it. You will feel the difference after the first week.

Questions or pushback on any of this? Find me on LinkedIn.

Share on X Share on LinkedIn

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.

Delta Lake and the Lakehouse Architecture: What Every Data Engineer Needs to Know