Vetora logo
🌊Data Engineering

Data Lake & Lakehouse

A data lake stores raw, unstructured, and structured data at any scale in open file formats on cheap object storage. The lakehouse architecture adds ACID transactions, schema enforcement, and time travel on top of data lakes using table formats like Delta Lake, Apache Iceberg, and Apache Hudi, bridging the gap between data lakes and data warehouses.

Overview

The **data lake** emerged as a reaction to the rigid schemas and high costs of traditional data warehouses. Instead of requiring data to conform to a predefined schema before loading (schema-on-write), a data lake stores raw data in its native format -- CSV, JSON, Parquet, images, logs -- on cheap object storage like S3, GCS, or ADLS. Any team can write any data to the lake without upfront schema design. Analytics engines (Spark, Presto, Athena) apply schema at read time (schema-on-read), enabling flexible exploration.

However, data lakes quickly became 'data swamps.' Without transactions, concurrent writes could corrupt data. Without schema enforcement, columns drifted across files. Without versioning, mistakes were irreversible. Without indexing, queries scanned entire directories. The promise of a universal data store collided with the reality of unreliable data.

The **lakehouse** architecture solves these problems by adding a transactional metadata layer on top of the data lake. Three open-source table formats compete: **Delta Lake** (Databricks, uses a JSON/Parquet transaction log), **Apache Iceberg** (Netflix, uses manifest files and snapshot metadata), and **Apache Hudi** (Uber, optimized for upserts and incremental processing). All three provide ACID transactions, schema evolution, partition evolution, time travel (query data as of a past timestamp), and efficient upsert/delete operations -- all while storing data in open Parquet files on object storage.

The lakehouse eliminates the need for a separate data warehouse for many workloads. A single copy of data on S3, managed by Iceberg, can serve batch ETL (Spark), interactive SQL queries (Trino/StarRocks), ML training (Ray/PyTorch reading Parquet directly), and streaming ingestion (Flink writing Iceberg commits). This 'single copy, multiple engines' approach reduces data duplication, storage costs, and ETL complexity.

Key Points
  • 1Data lakes store raw data in open formats (Parquet, ORC, Avro, JSON) on object storage (S3, GCS, ADLS). Storage costs are 10-100x cheaper than data warehouse storage ($0.023/GB/month for S3 vs $0.02-0.04/GB/month for warehouse compute-coupled storage).
  • 2Lakehouse table formats (Delta Lake, Iceberg, Hudi) add ACID transactions to data lakes. They maintain a metadata layer that tracks which files belong to each table version, enabling atomic commits, snapshot isolation, and rollback.
  • 3Apache Iceberg is emerging as the industry standard: it supports hidden partitioning (users write SQL without knowing the partition scheme), partition evolution (change partitioning without rewriting data), and works with multiple engines (Spark, Flink, Trino, Dremio, StarRocks).
  • 4Time travel allows querying data as of any past snapshot. This enables reproducible ML training (train on the exact dataset from last Tuesday), debugging (compare today's data to yesterday's), and regulatory compliance (audit historical state).
  • 5Schema evolution in lakehouse formats is non-destructive: adding, renaming, or reordering columns does not require rewriting existing data files. Iceberg uses unique column IDs rather than column positions, making evolution safe even when column order changes.
  • 6The small file problem is the primary performance pitfall in data lakes. Thousands of tiny files (e.g., from frequent streaming writes) cause excessive metadata overhead and slow query planning. Compaction jobs merge small files into larger ones (target: 128MB-1GB per file).
Simple Example

Lakehouse for a Ride-Sharing Platform

A ride-sharing company stores all trip data in Apache Iceberg tables on S3. Raw trip events are streamed from Kafka into an Iceberg table via Flink, committed every minute. The table is partitioned by date and city. The analytics team queries trip data using Trino for ad-hoc SQL analysis. The ML team reads the same Iceberg table directly from PyTorch for demand forecasting model training. The finance team uses Spark batch jobs to compute driver payments. When the ML team realizes they need data from last month for a retraining experiment, they use Iceberg's time travel: SELECT * FROM trips FOR TIMESTAMP AS OF '2026-05-01'. One copy of data, three engines, full ACID guarantees.

Real-World Examples

Netflix

Netflix created Apache Iceberg to manage their 100+ PB data lake on S3. Their previous Hive-based approach had severe limitations: partition pruning required users to know the physical layout, adding a partition column required rewriting all data, and concurrent writes caused data corruption. Iceberg's hidden partitioning, partition evolution, and snapshot isolation solved all three. Today, Iceberg manages millions of table snapshots across Netflix's data platform.

Apple

Apple adopted Apache Iceberg for their analytics platform, managing exabytes of data across thousands of tables. Their migration from Hive tables to Iceberg was driven by the need for ACID transactions (concurrent Spark jobs writing to the same table), schema evolution (adding columns without rewriting petabytes), and time travel for ML reproducibility. Apple is a major contributor to the Iceberg open-source project.

Uber

Uber created Apache Hudi (Hadoop Upserts Deletes and Incrementals) to handle their specific challenge: high-volume upserts on trip data. When a trip status changes (requested -> matched -> in-progress -> completed), the record must be updated in place. Hudi's Merge-on-Read and Copy-on-Write table types optimize for different upsert-to-read ratios. Uber's Hudi tables process over 5 trillion records across 10,000+ tables.

Trade-Offs
AspectDescription
Open Formats vs Proprietary WarehousesLakehouses store data in open Parquet files, avoiding vendor lock-in -- you can switch query engines without migrating data. Proprietary warehouses (Snowflake, BigQuery) offer better query performance through proprietary storage formats, automatic clustering, and deeply integrated optimization. The trade-off is flexibility and cost vs. out-of-the-box performance.
Copy-on-Write vs Merge-on-ReadCopy-on-Write (COW) rewrites entire data files on every update, making reads fast but writes slow and expensive. Merge-on-Read (MOR) writes updates to a delta log and merges at read time, making writes fast but reads slower (must merge base files with deltas). COW suits read-heavy analytical workloads; MOR suits write-heavy ingestion pipelines. Hudi supports both; Iceberg is converging on a similar model.
Single Copy vs Data DuplicationA lakehouse promises 'one copy of data serving all workloads,' eliminating ETL between lake and warehouse. In practice, performance-sensitive workloads (executive dashboards, sub-second BI) may still benefit from a materialized cache in a specialized engine (e.g., StarRocks, Pinot). The cost of a small data copy may be worth the query performance improvement.
Schema-on-Read Flexibility vs Data QualityData lakes historically used schema-on-read: store anything, figure out the schema later. This maximizes ingestion flexibility but leads to data swamps. Lakehouse table formats enforce schema-on-write (columns have types, constraints are checked) while still allowing schema evolution. Choosing the right enforcement level balances ingestion flexibility against downstream data quality.
Case Study

Expedia's Migration from Hive Data Lake to Iceberg Lakehouse

Scenario

Expedia's data platform stored 40+ PB across 100,000+ Hive tables on S3. They faced three critical problems: (1) Hive's partition listing scaled poorly -- queries on tables with 100K+ partitions took minutes just for planning. (2) Concurrent writes from multiple Spark jobs corrupted table metadata, requiring manual repair. (3) Schema changes (adding a column for a new booking attribute) required rewriting all historical data in affected partitions.

Solution

Expedia migrated their highest-value tables to Apache Iceberg. Iceberg's manifest-based metadata eliminated the S3 LIST bottleneck (O(1) partition pruning instead of O(n) directory listing). Optimistic concurrency control with snapshot isolation prevented write conflicts. Iceberg's column-ID-based schema evolution allowed adding columns without any data rewrite. They ran dual-write validation for 3 months to verify Iceberg results matched Hive.

Outcome

Query planning time for large tables dropped from minutes to seconds. Write conflicts were eliminated entirely. Schema evolution that previously required days of data rewriting became instant. Storage costs dropped 15% because Iceberg's intelligent file pruning meant queries read 60-80% less data. The migration validated Iceberg as their long-term data platform standard.

Common Mistakes
  • Treating a data lake as a 'dump everything and figure it out later' zone. Without catalog metadata, ownership, and access controls, data lakes become data swamps. Implement a data catalog (Unity Catalog, AWS Glue Catalog, Polaris) from day one with clear ownership and discoverability.
  • Ignoring the small file problem. Streaming ingestion that commits every few seconds creates thousands of tiny Parquet files, each a few KB. Query engines waste most of their time on metadata overhead rather than data scanning. Run regular compaction jobs to merge small files into 128MB-1GB target sizes.
  • Choosing a table format based on hype rather than requirements. Delta Lake is tightly integrated with Databricks/Spark. Iceberg has the broadest multi-engine support (Spark, Flink, Trino, Dremio). Hudi excels at record-level upserts. Evaluate based on your engine ecosystem and write patterns.
  • Not configuring snapshot expiration and orphan file cleanup. Lakehouse table formats retain every historical snapshot by default, causing metadata bloat and storage waste. Configure snapshot expiration (e.g., retain 7 days) and run orphan file cleanup regularly.
Related Concepts

See Data Lake & Lakehouse in action

Explore system design templates that use data lake & lakehouse and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Query a lakehouse with ACID tables over object storage

Metrics to watch
query_latency_mscompaction_duration_msstorage_cost_per_tbtime_travel_overhead_ms
Run Simulation
Test Your Understanding

1What is the primary problem that lakehouse table formats (Delta Lake, Iceberg, Hudi) solve?

2What is the 'small file problem' in data lakes, and how is it addressed?

Deeper Reading