Cloud Architecture

Lakehouse: Uniting the Data Lake and Warehouse in One Layer

For years the data lake and the data warehouse were two separate worlds and data was copied twice. The lakehouse collapses that duality into one layer with open table formats.

BIART Ekibi3 min read8 views
Lakehouse mimarisi ve medallion katmanları görseli

For a decade organisations ran two separate systems: a data lake that stored raw data cheaply, and a structured data warehouse for fast SQL. Data was usually copied twice, governed twice and produced two truths. The lakehouse architecture claims to end that duality: it brings the cheapness of object storage and the reliability of a warehouse into one layer.

The problem: the gulf between lake and warehouse

In the classic architecture data flowed: sources → data lake (Parquet/CSV, cheap but no ACID) → ETL → data warehouse (fast, ACID, but expensive and closed). The result: latency, double cost, broken lineage and the 'which copy is right' debate.

Open table formats

The technology that makes the lakehouse possible is open table formats: Apache Iceberg, Delta Lake and Apache Hudi. They add a metadata layer over the Parquet files in object storage and bring:

  • ACID transactions: consistency across concurrent reads/writes.
  • Time-travel: querying a past version of a table (critical for audit and error recovery).
  • Schema evolution: adding/changing columns without rewriting the table.
  • Partition evolution / hidden partitioning: changing the partition strategy without breaking query performance.

The medallion architecture

In a lakehouse, data usually moves through three quality layers:

  1. Bronze: raw data from the source, almost untouched.
  2. Silver: cleaned, deduplicated, joined data.
  3. Gold: business-ready aggregate and metric tables (BI and ML feed from here).

Every layer sits in the same open format; there is no separate system, only a different maturity level.

Compute-storage separation

The economic advantage of the lakehouse is that compute and storage scale independently. Data sits cheaply in S3/ADLS/GCS; at query time, engines like Spark, Trino, Dremio or Databricks spin up compute temporarily. If nobody queries overnight, compute cost drops to zero; the 'always-on cluster' cost of a warehouse disappears.

Is the warehouse dead?

No. For very low-latency, high-concurrency BI workloads, managed warehouses like Snowflake/BigQuery still deliver more predictable performance. Many organisations go hybrid: lakehouse for raw + silver + ML, warehouse as the gold/BI serving layer. The fact that formats like Iceberg can be read by both Snowflake and Spark makes that hybrid easier.

Migration plan

A typical migration from an existing lake + warehouse to a lakehouse:

  1. Choose an open format (Iceberg or Delta, depending on your ecosystem).
  2. Move the bronze/silver layers from the lake into the lakehouse format.
  3. Define the gold layer with dbt or Spark; tie metrics to the semantic layer.
  4. Point BI tools at the gold tables first, then retire the warehouse gradually.
  5. Integrate time-travel and lineage into audit processes.

Conclusion

The lakehouse removes the decade-old choice between 'cheap but messy lake' and 'fast but expensive warehouse' using open table formats. Thanks to ACID, time-travel and schema evolution, a single copy of the data feeds both ML and BI. Built correctly, the double cost, the double copy and the 'which one is right' debate all end.

Share