Data Lakehouse Architecture: When to Use It and How to Build One
The lakehouse pattern combines the scalability of a data lake with the reliability guarantees of a warehouse. Here is when that tradeoff actually makes sense and how to build it well.
I have built data platforms on both ends of the spectrum: tightly coupled cloud warehouses where everything runs through Snowflake, and loosely coupled lake architectures where Spark jobs write Parquet to S3 and hope for the best. The lakehouse pattern sits between those two extremes, and after working with it in production for a couple of years, my view is that it is genuinely the right default for a certain class of data platform.
The core idea is straightforward: store data in open formats on cheap object storage (S3, GCS, or Azure Data Lake), then apply a table format layer on top that gives you ACID transactions, schema evolution, time travel, and efficient query planning. The result is something that behaves like a warehouse from the query perspective but has the economics and flexibility of a lake.
The Three Open Table Formats: Delta Lake, Iceberg, and Hudi
The table format layer is where most teams get stuck, because there are three credible options and the differences between them are real but not always obvious from documentation alone.
Delta Lake is the Databricks-originated format and is deeply integrated into the Databricks runtime. If your team is already on Databricks, Delta Lake is the default choice and the right one. The tooling, documentation, and community support are excellent. Outside of Databricks, Delta Lake has improved significantly with the Delta Standalone and Delta Kernel projects, but it historically felt most native in that ecosystem.
Apache Iceberg has become the format I reach for on AWS and multi-engine environments. Iceberg has strong support across Spark, Flink, Trino, DuckDB, and most cloud-managed services (AWS Glue, Athena, and Redshift all support it natively). The catalog abstraction in Iceberg is particularly clean: it supports REST catalogs, Glue catalogs, and Hive Metastore, which matters when you need multiple compute engines reading the same tables. I have built Iceberg-based pipelines where Spark writes, Flink reads for real-time aggregations, and DuckDB queries for local development, all against the same table definition. That interoperability is hard to match.
Apache Hudi (Hadoop Upserts Deletes and Incrementals) originated at Uber and has strong CDC and streaming upsert capabilities. If your primary use case is near-real-time CDC from operational databases into your lake, Hudi deserves serious consideration. The Copy-on-Write vs. Merge-on-Read configuration gives you explicit control over the read/write performance tradeoff. Outside of that specific use case, Hudi has a steeper operational curve than Iceberg or Delta Lake.
For most greenfield projects in 2026, my recommendation is Iceberg on AWS or GCP, Delta Lake on Databricks. Hudi when CDC is the dominant workload.
Medallion Architecture: Bronze, Silver, Gold
The medallion architecture is the standard layering pattern for lakehouse data. I have seen it called different things (raw/cleansed/curated is common), but the structure is the same. Bronze is raw ingested data. Silver is cleansed and conformed. Gold is business-ready, query-optimized data for analytics and reporting.
Bronze is append-only and schema-on-read. You write exactly what you received from the source, including any weirdness. This is your insurance policy: if your transformation logic is wrong, you can reprocess from bronze without touching the source system again.
Silver is where most of the transformation work happens: null handling, deduplication, type casting, join to reference data, and application of business rules. Silver tables should have explicit schemas, dbt tests, and freshness SLAs. This is the layer that downstream teams depend on.
Gold is optimized for consumption: aggregated fact tables, pre-joined dimensions, materialized metrics. Gold tables are often the most expensive to produce and should only exist for workloads that are actually query-heavy. I have seen teams create gold tables for every possible view of data and then watch their pipeline costs triple for no real benefit. Be deliberate about what goes in gold.
One pattern I have found valuable: separate gold layers by domain. The analytics gold layer has different SLAs and optimization choices than the gold layer feeding a reverse ETL sync to Salesforce. Treating them as the same layer creates contention and brittle dependencies.
When a Lakehouse Beats a Data Warehouse
A cloud warehouse (Snowflake, BigQuery, Redshift) is still the right default for many teams. The operational burden is low, the SQL experience is excellent, and for teams under a few terabytes with primarily SQL workloads, the cost is often competitive. I am not arguing against warehouses. I am arguing for being deliberate about the choice.
A lakehouse is the better choice when one or more of these conditions apply:
- Your data volumes are large enough that per-byte warehouse costs become a significant budget item
- You have multiple compute engines that need to read the same data (Spark for ML feature engineering, SQL for analytics, Flink for streaming aggregations)
- You need to retain raw data indefinitely for compliance or reprocessing, and warehouse storage costs are prohibitive
- Your team has machine learning workloads that need access to unstructured or semi-structured data alongside tabular data
- You are building on AWS and want to avoid cloud vendor lock-in on your primary data storage
The teams that struggle with lakehouses are usually the ones that underestimated the operational overhead. You own the table compaction, the vacuum jobs, the catalog management, and the access control in a way you do not with a managed warehouse. If your team does not have dedicated data engineering capacity, start with a warehouse and move toward lakehouse patterns as scale demands it.
Hands-On: DuckDB and Spark Against Iceberg Tables
One of the most useful development patterns I have found is using DuckDB locally to query the same Iceberg tables that Spark writes in the pipeline. This removes the need for a running cluster during development and makes iteration dramatically faster.
-- Query an Iceberg table locally with DuckDB -- Requires: INSTALL iceberg; LOAD iceberg; SELECT event_date, COUNT(*) as event_count, SUM(revenue_cents) / 100.0 as revenue FROM iceberg_scan( 's3://my-bucket/silver/events/', allow_moved_paths = true ) WHERE event_date >= current_date - INTERVAL 7 DAYS GROUP BY 1 ORDER BY 1 DESC;
For Spark writes, the pattern I default to for silver-layer upserts:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-bucket/warehouse/") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.getOrCreate()
# Merge upsert into silver layer
spark.sql("""
MERGE INTO glue_catalog.silver.events t
USING (SELECT * FROM new_events_staging) s
ON t.event_id = s.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
# Compact small files after write
spark.sql("""
CALL glue_catalog.system.rewrite_data_files(
table => 'silver.events',
strategy => 'binpack',
options => map('target-file-size-bytes', '134217728')
)
""")The rewrite_data_files call is important and easy to forget. Iceberg tables accumulate small files over time from incremental writes. Compaction keeps read performance acceptable and reduces the metadata overhead that slows down query planning on large tables.
The Operational Reality
I want to be direct about the operational side, because the lakehouse pitch often glosses over it. Running a lakehouse well requires ongoing maintenance: table compaction, orphan file cleanup, snapshot expiration, and catalog management. None of it is particularly difficult, but it needs to be scheduled and monitored. I run these as Dagster assets on a weekly cadence, with asset checks that alert if compaction has not run or if table sizes have grown anomalously.
Access control is also more complex than a managed warehouse. You are managing S3 bucket policies, IAM roles, and potentially Ranger or Lake Formation policies depending on your setup. If your team has strong security and compliance requirements, factor in the engineering time to implement that correctly.
None of this is a reason to avoid the lakehouse pattern. It is a reason to plan for it. The cost savings and flexibility at scale are real. The operational overhead is also real, and the teams that treat it as an afterthought end up with data swamps rather than lakehouses.
Found this useful? Share it: