Apache Iceberg and the Open Lakehouse: Why Every Data Engineer Needs to Know It in 2026
November 2, 2025 · 9 min read
The modern data stack is converging on a simple idea: store data in open formats on object storage, and use a transaction layer that makes it behave like a real database. Apache Iceberg is that layer. It is an open table format built for huge analytic datasets, with a metadata model that makes updates, deletes, and time travel reliable at scale.
If you are building a lakehouse in 2026, Iceberg is not just a nice to have. It is one of the most important pieces of the architecture. It gives you consistency without locking you into a single vendor, and it works across Spark, Flink, Trino, and the fast growing list of modern engines.
What Apache Iceberg Is and Why It Matters
Iceberg is a table format for data lakes. It stores data in Parquet, ORC, or Avro files, and uses metadata files to track snapshots, partitioning, and schema history. The core idea is simple: data files are immutable, and table state is defined by a snapshot that points to a list of files. Every write creates a new snapshot, so reads are consistent, and you can go back in time without copying data.
This matters because data lakes without a table format are fragile. They break when schemas drift, when jobs overlap, or when you need to delete rows. Iceberg makes the lake behave like a database while keeping the lake open. You get ACID semantics, predictable query planning, and a path to multi engine analytics without moving your data.
Key Features You Actually Feel in Production
Iceberg features are not abstract, they show up in daily operations. Schema evolution means you can add columns safely without breaking downstream jobs. You can rename columns and change types with explicit rules, and the schema history is tracked so old snapshots still read correctly.
Time travel and snapshot isolation are the safety net. Every write creates a snapshot, and readers see a consistent view of the table even while writers commit. That makes debugging and auditing much simpler. You can query a snapshot by ID or timestamp and validate exactly what a report used last week.
Partition evolution and hidden partitioning are the most underrated wins. Iceberg stores partition values in metadata, not in the file path. You can change partitioning strategy without rewriting the table, and you avoid leaking partition logic into every query. That saves both compute and human time.
Iceberg vs Delta Lake vs Hudi
These formats solve the same class of problems, but they optimize for different tradeoffs. Here is the short comparison I use when advising teams:
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Engine support | Broad, vendor neutral | Best in Spark and Databricks | Strong in Spark, evolving elsewhere |
| Schema evolution | Yes, with rename and type changes | Yes, with merge and alter options | Yes, but less uniform across engines |
| Time travel | Snapshot based | Version based | Commit based |
| Partition evolution | First class and flexible | Supported, but path based in many cases | Supported with more operational tuning |
| Streaming upserts | Good, improving quickly | Good, especially on Databricks | Excellent, core strength |
| Catalog options | Hive, Glue, Nessie, REST | Hive and proprietary catalogs | Hive, Glue, and others |
If you are deep in Spark and Databricks, Delta Lake is still the smoothest operational experience. If you need streaming upserts at low latency, Hudi has real advantages. If you need vendor neutrality and a clean separation between compute and storage, Iceberg is the default answer.
Integration With the Modern Stack
Iceberg is not tied to a single engine. Spark has first class support, and you can read and write tables using the Iceberg catalog APIs. Flink supports Iceberg for streaming and batch, which is useful for near real time pipelines. Trino and Presto allow fast interactive queries across Iceberg tables without moving data.
dbt integrates through adapters and can manage Iceberg models in platforms like Spark, Trino, and Athena. Snowflake supports Iceberg tables in an open catalog and can query them without copying into internal storage. BigQuery supports Iceberg external tables on GCS, which is a major shift for teams already invested in GCP.
The Open Lakehouse Architecture
The open lakehouse pattern is simple and powerful. Store data in object storage like S3 or GCS, define tables with Iceberg, and let compute engines attach through a shared catalog. Your data lives in open files, your metadata lives in open tables, and your compute can change over time.
Iceberg is the foundation because it separates storage from compute without losing correctness. You can run Spark for batch, Flink for streaming, Trino for BI, and let each engine see the same table state. That is the difference between a lakehouse and a pile of Parquet files.
Production Patterns That Keep Iceberg Healthy
The first pattern is compaction. Iceberg tables can accumulate small files from streaming and micro batch jobs. Run periodic compaction jobs to rewrite small files into larger ones, and avoid slow metadata scans. Many teams run nightly compaction or trigger it when file counts exceed a threshold.
The second pattern is metadata management. Iceberg uses manifests and metadata files, and these can grow over time. Use snapshot expiration and manifest rewrite operations to keep metadata efficient. You should also track metrics like file count, manifest count, and metadata size as part of operational monitoring.
The third pattern is catalog choice. Hive Metastore is common for self managed clusters. AWS Glue is the default on AWS. Nessie adds Git like branching and tagging on metadata, which is great for data lifecycle control. The REST catalog is the emerging standard for tool interoperability and managed services. Pick the catalog that matches your governance and multi engine needs.
When to Use Iceberg and When Not To
Use Iceberg when you need a vendor neutral table format, when you want multiple engines to read the same data, or when you care about long term ownership of your data architecture. It shines for append heavy analytics, slowly changing dimensions, and large datasets that need reliable evolution.
Do not use Iceberg if you need ultra low latency streaming upserts at extreme scale and you are already committed to a Hudi stack. Do not use it if you want a fully managed warehouse and do not care about open storage. A lakehouse is not always the answer if your data is small and your queries are simple.
PySpark Examples
The API surface is clean and familiar. Here is how to create an Iceberg table in Spark using a catalog:
spark.sql("""
CREATE TABLE prod.analytics.orders (
order_id STRING,
customer_id STRING,
order_total DOUBLE,
order_ts TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(order_ts))
""")Time travel is snapshot based. You can query by snapshot ID or timestamp:
df = spark.read.format("iceberg") \
.option("snapshot-id", 4928382749921) \
.load("prod.analytics.orders")Schema evolution is explicit and safe. This example adds a column without rewriting the full dataset:
spark.sql("""
ALTER TABLE prod.analytics.orders
ADD COLUMN order_channel STRING
""")Closing
The open lakehouse is here, and Iceberg is one of its most important building blocks. It gives you reliability, portability, and room to grow as your stack changes. If you are a data engineer in 2026, learn Iceberg deeply. It will show up in interviews, architecture reviews, and real world pipelines. More importantly, it will make your lakehouse feel like a database without giving up the benefits of open storage.
Questions or pushback on any of this? Find me on LinkedIn.
Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.