For a decade organisations ran two separate systems: a data lake that stored raw data cheaply, and a structured data warehouse for fast SQL. Data was usually copied twice, governed twice and produced two truths. The lakehouse architecture claims to end that duality: it brings the cheapness of object storage and the reliability of a warehouse into one layer.
The problem: the gulf between lake and warehouse
In the classic architecture data flowed: sources → data lake (Parquet/CSV, cheap but no ACID) → ETL → data warehouse (fast, ACID, but expensive and closed). The result: latency, double cost, broken lineage and the 'which copy is right' debate.
Open table formats
The technology that makes the lakehouse possible is open table formats: Apache Iceberg, Delta Lake and Apache Hudi. They add a metadata layer over the Parquet files in object storage and bring:
- ACID transactions: consistency across concurrent reads/writes.
- Time-travel: querying a past version of a table (critical for audit and error recovery).
- Schema evolution: adding/changing columns without rewriting the table.
- Partition evolution / hidden partitioning: changing the partition strategy without breaking query performance.
The medallion architecture
In a lakehouse, data usually moves through three quality layers:
- Bronze: raw data from the source, almost untouched.
- Silver: cleaned, deduplicated, joined data.
- Gold: business-ready aggregate and metric tables (BI and ML feed from here).
Every layer sits in the same open format; there is no separate system, only a different maturity level.
Compute-storage separation
The economic advantage of the lakehouse is that compute and storage scale independently. Data sits cheaply in S3/ADLS/GCS; at query time, engines like Spark, Trino, Dremio or Databricks spin up compute temporarily. If nobody queries overnight, compute cost drops to zero; the 'always-on cluster' cost of a warehouse disappears.
Is the warehouse dead?
No. For very low-latency, high-concurrency BI workloads, managed warehouses like Snowflake/BigQuery still deliver more predictable performance. Many organisations go hybrid: lakehouse for raw + silver + ML, warehouse as the gold/BI serving layer. The fact that formats like Iceberg can be read by both Snowflake and Spark makes that hybrid easier.
Migration plan
A typical migration from an existing lake + warehouse to a lakehouse:
- Choose an open format (Iceberg or Delta, depending on your ecosystem).
- Move the bronze/silver layers from the lake into the lakehouse format.
- Define the gold layer with dbt or Spark; tie metrics to the semantic layer.
- Point BI tools at the gold tables first, then retire the warehouse gradually.
- Integrate time-travel and lineage into audit processes.
Conclusion
The lakehouse removes the decade-old choice between 'cheap but messy lake' and 'fast but expensive warehouse' using open table formats. Thanks to ACID, time-travel and schema evolution, a single copy of the data feeds both ML and BI. Built correctly, the double cost, the double copy and the 'which one is right' debate all end.
