Blog

Apache Iceberg for Data Engineers: The Complete Guide to Open Table Formats in 2026

Ryan Kirsch · November 19, 2025 · 12 min read

I build systems that have to survive years of schema changes, engine swaps, and backfills that touch petabytes. Apache Iceberg is the table format that made those systems feel stable. This guide is the hands on version of how I use Iceberg in production, with code, migration patterns, and the tradeoffs that actually matter when you are on call.

What Apache Iceberg Is and Why It Matters

Apache Iceberg is an open table format designed for massive analytic datasets. It sits between your compute engines and your object storage and gives you ACID transactions, versioned metadata, and a consistent table contract across Spark, Flink, Trino, and more. I treat it like a database layer for the data lake. It keeps raw files simple while making table operations safe, repeatable, and auditable.

The reason it matters is operational. Without a table format, Parquet on object storage is just files with no shared transaction log. Iceberg adds that log and enforces snapshots, so readers get a consistent view even when multiple writers are active. That changes how confidently you can iterate on models, add partitions, or run backfills.

The Core Problems Iceberg Solves

I use Iceberg for four reasons that show up daily in production.

  • ACID transactions on the data lake. Writers commit atomically, and readers see consistent snapshots instead of partial updates.
  • Schema evolution without downtime. You can add, rename, or reorder columns and keep historical snapshots readable.
  • Time travel for debugging and audits. Snapshots let you query what a report used at a specific timestamp or snapshot ID.
  • Partition evolution without rewriting data. You can change partitioning strategy over time and keep old files valid.

These are not abstract features. They are the difference between a calm on call rotation and a midnight fire drill when a backfill lands.

Iceberg vs Delta Lake vs Hudi

All three formats solve similar problems, but the tradeoffs are real. I use this comparison when I am helping teams pick a standard.

CapabilityIcebergDelta LakeHudi
Governance modelOpen, vendor neutralOpen core with strong Databricks gravityOpen, driven by streaming use cases
Engine supportBroad across Spark, Flink, Trino, AthenaBest in Spark and DatabricksStrong in Spark, improving elsewhere
Schema evolutionFull evolution with rename and type changesGood, especially in SparkAvailable, but less consistent across engines
Time travelSnapshot basedVersion basedCommit based
Community momentumStrong, cross vendor and foundation ledStrong, centered on Databricks ecosystemStrong, led by streaming and CDC teams

If you are deep in Databricks, Delta Lake is the smoothest operational choice. If your system revolves around streaming upserts, Hudi has advantages. If you want an open format with broad engine support and clean separation between compute and storage, Iceberg is the default I reach for.

Hands On With PySpark

These are the exact patterns I use for day to day Iceberg work. The examples assume Spark configured with an Iceberg catalog.

# Create a database and an Iceberg table
spark.sql("CREATE DATABASE IF NOT EXISTS lakehouse")

spark.sql("""
CREATE TABLE IF NOT EXISTS lakehouse.events (
  event_id STRING,
  user_id STRING,
  event_name STRING,
  event_ts TIMESTAMP,
  ingest_date DATE
)
USING iceberg
PARTITIONED BY (days(ingest_date))
""")

Schema evolution is one of the reasons I move off raw Parquet. Add a new column without breaking old readers.

# Add a column and backfill it later
spark.sql("ALTER TABLE lakehouse.events ADD COLUMN source STRING")

# Rename a column safely
spark.sql("ALTER TABLE lakehouse.events RENAME COLUMN event_name TO event_type")

Time travel is how I debug reports. I query the snapshot that existed when a dashboard ran and compare it to the current view.

-- Query a snapshot by timestamp
SELECT *
FROM lakehouse.events
FOR SYSTEM_TIME AS OF TIMESTAMP '2026-03-10 08:00:00';

-- Query by snapshot ID
SELECT *
FROM lakehouse.events
FOR SYSTEM_VERSION AS OF 952310148750533;

Integration Patterns That Work

I keep the table format consistent and let each engine do what it is best at. These are the patterns I see succeed.

  • dbt for transformations. Use a Spark, Trino, or Athena adapter that writes Iceberg tables directly and keep models versioned in git.
  • Spark for batch ingestion and heavy transforms. It is still the most flexible engine for large rewrite jobs and compaction.
  • Flink for streaming upserts and near real time pipelines. Iceberg handles the snapshot commits while Flink manages the state.
  • Trino for interactive analytics and federated queries. It lets you join Iceberg with other sources without copying data.

The key is to standardize on one catalog, then enforce a single table contract across engines. That keeps lineage clean and avoids data drift.

Real World Migration From Parquet

I recently migrated a multi petabyte Parquet lake to Iceberg. The core idea was to avoid a single big bang. We introduced an Iceberg catalog, then converted the highest value tables first. New writes went to Iceberg, and we used copy on write to materialize the same data while keeping legacy jobs alive.

The migration workflow was consistent. Create the Iceberg table with the current schema, backfill partitions in parallel, then flip readers to the new table snapshot by snapshot. We validated row counts and sample hashes during the cutover. The biggest win was removing the custom manifest code we had built to track files. Iceberg replaced that logic with a metadata layer that every engine could read.

Decision Framework: Iceberg vs Delta Lake

I use a simple framework when teams ask which format to standardize on. If you are already all in on Databricks and Spark, Delta Lake is often the fastest path to stability. If you need open governance, a portable catalog, and multiple engines, Iceberg is the safer long term bet.

The tipping points are practical. Choose Iceberg when you plan to use Trino or Flink at scale, when you need time travel across engines, or when you want to avoid vendor lock in. Choose Delta Lake when the team is Spark first, the platform is Databricks centered, and you want the smoothest managed experience.

Production Tips I Actually Use

  • Enforce a single catalog. Multiple catalogs for the same table format create drift and break lineage.
  • Schedule compaction. Small files kill performance, and compaction is the lowest effort optimization you can automate.
  • Track snapshot retention policies. Keep enough history for debugging, but expire old snapshots to control metadata growth.
  • Treat partition evolution as a planned change. Update partition specs intentionally, then validate the query plan across engines.
  • Test time travel. Run a weekly check that queries a prior snapshot and validates counts so audits do not fail when you need them most.

Bottom Line

Iceberg is the table format I trust for large scale analytics. It gives me a stable contract across engines, safe schema evolution, and a clear history of every write. If you are building a data lake that has to survive years of change, Iceberg is not a nice to have, it is the foundation.

RK

Ryan Kirsch

Data Engineer at the Philadelphia Inquirer. Writing about practical data engineering, local-first stacks, and systems that scale without a cloud bill.

View portfolio →