Data Engineering

Batch processing, stream processing, ETL pipelines, and data lakes.

Concepts

Batch processing operates on bounded, finite datasets by collecting data over a period and processing it as a single unit. Frameworks like MapReduce, Spark, and Hive enable parallel computation across commodity clusters, trading latency for throughput. Batch remains the backbone of data warehousing, ML training pipelines, and large-scale analytics.

Stream ProcessingP0

Stream processing handles unbounded, continuously arriving data in real time or near-real time. Engines like Apache Flink, Kafka Streams, and Spark Structured Streaming process events as they arrive, enabling sub-second latency for use cases like fraud detection, real-time dashboards, and session analytics.

ETL PipelinesP0

ETL (Extract, Transform, Load) pipelines move data from source systems to analytical destinations by extracting raw data, transforming it into a usable schema, and loading it into a target store. Modern variants include ELT (load raw, then transform in-warehouse) and reverse ETL (push warehouse data back to operational tools).

Data Lake & LakehouseP1

A data lake stores raw, unstructured, and structured data at any scale in open file formats on cheap object storage. The lakehouse architecture adds ACID transactions, schema enforcement, and time travel on top of data lakes using table formats like Delta Lake, Apache Iceberg, and Apache Hudi, bridging the gap between data lakes and data warehouses.

Change Data CaptureP1

Change Data Capture (CDC) tracks row-level changes (inserts, updates, deletes) in a source database and propagates them downstream in real time. Log-based CDC reads the database's transaction log (WAL, binlog) to capture every change without impacting source performance, enabling real-time data replication, event-driven architectures, and streaming ETL.

Data Partitioning & ShufflesP1

Data partitioning determines how records are distributed across nodes, files, or partitions in a distributed system. Shuffles are the expensive redistribution of data between partitions during operations like joins and aggregations. Understanding partitioning strategies and shuffle mechanics is essential for optimizing distributed query performance.

Exactly-Once SemanticsP1

Exactly-once semantics guarantees that each record in a data pipeline is processed exactly one time, producing the same result as if no failures occurred. Achieving this in distributed systems requires coordinating source offsets, processing state, and sink writes atomically -- a challenge solved by techniques like idempotent producers, transactional sinks, and distributed snapshots.