1What is the primary problem that lakehouse table formats (Delta Lake, Iceberg, Hudi) solve?
A data lake stores raw, unstructured, and structured data at any scale in open file formats on cheap object storage. The lakehouse architecture adds ACID transactions, schema enforcement, and time travel on top of data lakes using table formats like Delta Lake, Apache Iceberg, and Apache Hudi, bridging the gap between data lakes and data warehouses.
The **data lake** emerged as a reaction to the rigid schemas and high costs of traditional data warehouses. Instead of requiring data to conform to a predefined schema before loading (schema-on-write), a data lake stores raw data in its native format -- CSV, JSON, Parquet, images, logs -- on cheap object storage like S3, GCS, or ADLS. Any team can write any data to the lake without upfront schema design. Analytics engines (Spark, Presto, Athena) apply schema at read time (schema-on-read), enabling flexible exploration.
However, data lakes quickly became 'data swamps.' Without transactions, concurrent writes could corrupt data. Without schema enforcement, columns drifted across files. Without versioning, mistakes were irreversible. Without indexing, queries scanned entire directories. The promise of a universal data store collided with the reality of unreliable data.
The **lakehouse** architecture solves these problems by adding a transactional metadata layer on top of the data lake. Three open-source table formats compete: **Delta Lake** (Databricks, uses a JSON/Parquet transaction log), **Apache Iceberg** (Netflix, uses manifest files and snapshot metadata), and **Apache Hudi** (Uber, optimized for upserts and incremental processing). All three provide ACID transactions, schema evolution, partition evolution, time travel (query data as of a past timestamp), and efficient upsert/delete operations -- all while storing data in open Parquet files on object storage.
The lakehouse eliminates the need for a separate data warehouse for many workloads. A single copy of data on S3, managed by Iceberg, can serve batch ETL (Spark), interactive SQL queries (Trino/StarRocks), ML training (Ray/PyTorch reading Parquet directly), and streaming ingestion (Flink writing Iceberg commits). This 'single copy, multiple engines' approach reduces data duplication, storage costs, and ETL complexity.
Lakehouse for a Ride-Sharing Platform
A ride-sharing company stores all trip data in Apache Iceberg tables on S3. Raw trip events are streamed from Kafka into an Iceberg table via Flink, committed every minute. The table is partitioned by date and city. The analytics team queries trip data using Trino for ad-hoc SQL analysis. The ML team reads the same Iceberg table directly from PyTorch for demand forecasting model training. The finance team uses Spark batch jobs to compute driver payments. When the ML team realizes they need data from last month for a retraining experiment, they use Iceberg's time travel: SELECT * FROM trips FOR TIMESTAMP AS OF '2026-05-01'. One copy of data, three engines, full ACID guarantees.
Netflix
Netflix created Apache Iceberg to manage their 100+ PB data lake on S3. Their previous Hive-based approach had severe limitations: partition pruning required users to know the physical layout, adding a partition column required rewriting all data, and concurrent writes caused data corruption. Iceberg's hidden partitioning, partition evolution, and snapshot isolation solved all three. Today, Iceberg manages millions of table snapshots across Netflix's data platform.
Apple
Apple adopted Apache Iceberg for their analytics platform, managing exabytes of data across thousands of tables. Their migration from Hive tables to Iceberg was driven by the need for ACID transactions (concurrent Spark jobs writing to the same table), schema evolution (adding columns without rewriting petabytes), and time travel for ML reproducibility. Apple is a major contributor to the Iceberg open-source project.
Uber
Uber created Apache Hudi (Hadoop Upserts Deletes and Incrementals) to handle their specific challenge: high-volume upserts on trip data. When a trip status changes (requested -> matched -> in-progress -> completed), the record must be updated in place. Hudi's Merge-on-Read and Copy-on-Write table types optimize for different upsert-to-read ratios. Uber's Hudi tables process over 5 trillion records across 10,000+ tables.
| Aspect | Description |
|---|---|
| Open Formats vs Proprietary Warehouses | Lakehouses store data in open Parquet files, avoiding vendor lock-in -- you can switch query engines without migrating data. Proprietary warehouses (Snowflake, BigQuery) offer better query performance through proprietary storage formats, automatic clustering, and deeply integrated optimization. The trade-off is flexibility and cost vs. out-of-the-box performance. |
| Copy-on-Write vs Merge-on-Read | Copy-on-Write (COW) rewrites entire data files on every update, making reads fast but writes slow and expensive. Merge-on-Read (MOR) writes updates to a delta log and merges at read time, making writes fast but reads slower (must merge base files with deltas). COW suits read-heavy analytical workloads; MOR suits write-heavy ingestion pipelines. Hudi supports both; Iceberg is converging on a similar model. |
| Single Copy vs Data Duplication | A lakehouse promises 'one copy of data serving all workloads,' eliminating ETL between lake and warehouse. In practice, performance-sensitive workloads (executive dashboards, sub-second BI) may still benefit from a materialized cache in a specialized engine (e.g., StarRocks, Pinot). The cost of a small data copy may be worth the query performance improvement. |
| Schema-on-Read Flexibility vs Data Quality | Data lakes historically used schema-on-read: store anything, figure out the schema later. This maximizes ingestion flexibility but leads to data swamps. Lakehouse table formats enforce schema-on-write (columns have types, constraints are checked) while still allowing schema evolution. Choosing the right enforcement level balances ingestion flexibility against downstream data quality. |
Expedia's Migration from Hive Data Lake to Iceberg Lakehouse
Scenario
Expedia's data platform stored 40+ PB across 100,000+ Hive tables on S3. They faced three critical problems: (1) Hive's partition listing scaled poorly -- queries on tables with 100K+ partitions took minutes just for planning. (2) Concurrent writes from multiple Spark jobs corrupted table metadata, requiring manual repair. (3) Schema changes (adding a column for a new booking attribute) required rewriting all historical data in affected partitions.
Solution
Expedia migrated their highest-value tables to Apache Iceberg. Iceberg's manifest-based metadata eliminated the S3 LIST bottleneck (O(1) partition pruning instead of O(n) directory listing). Optimistic concurrency control with snapshot isolation prevented write conflicts. Iceberg's column-ID-based schema evolution allowed adding columns without any data rewrite. They ran dual-write validation for 3 months to verify Iceberg results matched Hive.
Outcome
Query planning time for large tables dropped from minutes to seconds. Write conflicts were eliminated entirely. Schema evolution that previously required days of data rewriting became instant. Storage costs dropped 15% because Iceberg's intelligent file pruning meant queries read 60-80% less data. The migration validated Iceberg as their long-term data platform standard.
See Data Lake & Lakehouse in action
Explore system design templates that use data lake & lakehouse and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary problem that lakehouse table formats (Delta Lake, Iceberg, Hudi) solve?
2What is the 'small file problem' in data lakes, and how is it addressed?