Vetora logo
🌊Data Engineering

Stream Processing

Stream processing handles unbounded, continuously arriving data in real time or near-real time. Engines like Apache Flink, Kafka Streams, and Spark Structured Streaming process events as they arrive, enabling sub-second latency for use cases like fraud detection, real-time dashboards, and session analytics.

Overview

Stream processing treats data as an infinite, continuously arriving sequence of events rather than a bounded dataset. Instead of waiting for all data to arrive and then processing it, a stream processor ingests each event as it is produced and computes results incrementally. This enables latencies measured in milliseconds or seconds rather than hours.

The foundational challenge of stream processing is **time semantics**. Events have two timestamps: the time they occurred (**event time**) and the time they arrive at the processor (**processing time**). These can differ by seconds, minutes, or even hours due to network delays, mobile device buffering, or backfills. Event-time processing -- grouping events by when they actually happened, not when they arrived -- is essential for correct analytics but requires **watermarks**: heuristic signals that tell the engine 'all events up to time T have probably arrived.' Flink's watermark mechanism is the industry standard.

**Windowing** is how stream processors group unbounded data into finite chunks for aggregation. Tumbling windows (fixed, non-overlapping: e.g., every 5 minutes), sliding windows (overlapping: 5-minute window sliding every 1 minute), and session windows (grouped by activity gaps: a session ends after 30 minutes of inactivity) each serve different use cases. Window semantics interact with watermarks to determine when a window's result is emitted and how late data is handled.

Apache **Flink** (2015) is the most capable open-source stream processor, offering event-time semantics, exactly-once state consistency via distributed snapshots (Chandy-Lamport algorithm), and sophisticated windowing. **Kafka Streams** takes a different approach: a lightweight library embedded in your application (no separate cluster), ideal for simpler transformations and microservice-scale stream processing. **Spark Structured Streaming** bridges batch and streaming with a micro-batch execution model, sacrificing true per-event latency for Spark's rich ecosystem.

Key Points
  • 1Stream processing operates on unbounded data -- the input has no defined end. This fundamentally changes the programming model: you cannot sort the entire dataset or compute a global max. All operations must be incremental.
  • 2Event-time vs processing-time: event time is when the event occurred; processing time is when the system receives it. Correct stream analytics requires event-time processing with watermarks to handle out-of-order and late-arriving events.
  • 3Windowing converts unbounded streams into finite aggregation units. Tumbling (non-overlapping fixed windows), sliding (overlapping windows), and session (activity-gap-based) windows are the three core types. Each has different latency, completeness, and resource trade-offs.
  • 4Flink uses the Chandy-Lamport distributed snapshot algorithm for fault tolerance: periodic, consistent snapshots of all operator state are saved to durable storage. On failure, the entire pipeline restarts from the last snapshot, replays events from Kafka offsets, and recovers exactly-once state.
  • 5Kafka Streams is a library, not a framework -- it runs as part of your application JVM with no separate cluster to manage. It uses Kafka's consumer group protocol for parallelism and RocksDB for local state. This simplicity makes it ideal for microservice event processing.
  • 6Back-pressure is critical in streaming: if a downstream operator is slower than upstream, the system must slow the source rather than buffering unboundedly. Flink uses credit-based flow control; Kafka Streams relies on Kafka's consumer pull model.
Simple Example

Real-Time Fraud Detection

A payment processor uses Flink to detect fraudulent transactions in real time. Each payment event flows from Kafka into a Flink pipeline. The pipeline enriches each event with the user's recent transaction history (stored in Flink's keyed state -- the last 100 transactions per user), computes velocity features (number of transactions in the last 10 minutes using a sliding window), and applies an ML scoring model. If the fraud score exceeds a threshold, the transaction is flagged and routed to a review queue -- all within 200 milliseconds of the transaction occurring. A batch system doing this hourly would miss fraudulent bursts entirely.

Real-World Examples

Alibaba

Alibaba runs the world's largest Flink deployment, processing over 40 billion events per second during Singles' Day (11.11). Real-time stream processing powers live sales dashboards, dynamic pricing, inventory tracking, and fraud detection. Their internal Blink fork of Flink processes petabytes per hour with sub-second latency, enabling the real-time GMV ticker shown during the event.

Netflix

Netflix uses Flink for real-time data enrichment, session analytics, and operational monitoring. Their Keystone streaming platform processes over 1.5 trillion events per day from microservice logs, player events, and user interactions. Flink pipelines compute per-title viewing metrics in real time, enabling immediate detection of playback quality degradation (buffering ratio spikes) and triggering automated CDN failovers within minutes.

Uber

Uber's AthenaX platform (built on Flink) processes rider and driver events in real time for dynamic pricing (surge), ETA computation, and marketplace matching. The streaming pipeline computes supply-demand ratios per geographic hex cell every 30 seconds using tumbling windows. This feeds the surge pricing algorithm, which must react to demand changes faster than any batch system could provide.

Trade-Offs
AspectDescription
Latency vs CompletenessStream processing delivers results in milliseconds but must cope with incomplete data. Late-arriving events may arrive after a window has closed. The watermark mechanism trades off latency (how long to wait) against completeness (how much late data to include). Aggressive watermarks give fast results but miss more late data; conservative watermarks are more complete but add latency.
Operational Complexity vs Real-Time ValueStream processing requires managing stateful operators, checkpoints, watermarks, and back-pressure -- significantly more complex than batch. You need monitoring for consumer lag, checkpoint duration, and state size. This complexity is justified only when the business value of real-time data (fraud prevention, live dashboards, instant personalization) exceeds the operational cost.
State Management vs ScalabilityStateful stream processing (e.g., maintaining per-user session state) is powerful but challenging. State grows with the number of keys and the retention window. Flink stores state in RocksDB with periodic checkpoints to S3; Kafka Streams uses compacted changelog topics. Large state (hundreds of GB per operator) increases checkpoint time and recovery time, limiting scalability.
Exactly-Once vs PerformanceExactly-once semantics in streaming requires coordinating source offsets, operator state, and sink writes in a single atomic unit. Flink's two-phase commit sink and Kafka's transactional producer achieve this but add latency (commit intervals of 100ms-1s) and reduce throughput by 10-30%. At-least-once processing is simpler and faster but requires idempotent sinks.
Case Study

LinkedIn's Unified Streaming Platform with Samza and Kafka

Scenario

LinkedIn needed to process billions of events per day in real time for feed ranking, notification delivery, ad targeting, and anti-abuse detection. Their initial approach used custom consumer applications reading from Kafka, but these lacked proper state management, fault tolerance, and windowing -- each team reinvented these primitives differently, leading to correctness bugs and operational burden.

Solution

LinkedIn developed Apache Samza, a stream processing framework tightly integrated with Kafka. Samza uses Kafka topics for both input and changelog-based state recovery. Each Samza job partitions state by key (aligned with Kafka partition assignment), stores state locally in RocksDB, and replicates state changes to a compacted Kafka changelog topic. On failure, a new container replays the changelog to restore state.

Outcome

LinkedIn's streaming platform processes over 5 trillion messages per day across thousands of Samza jobs. Feed relevance improved because ranking signals (who viewed what, engagement patterns) update in seconds rather than hours. Notification delivery latency dropped from minutes to under 5 seconds. The standardized framework eliminated correctness bugs from ad-hoc state management.

Common Mistakes
  • Using processing time instead of event time for analytics aggregations. Processing-time windows mix events from different actual time periods when there is lag or replay, producing incorrect counts. Always use event-time semantics with watermarks for any aggregation that needs to reflect when events actually occurred.
  • Setting watermarks too aggressively (low allowed lateness) and silently dropping important late-arriving events. Monitor your late-event discard rate and configure side outputs to capture late data for later reprocessing rather than dropping it.
  • Storing unbounded state without TTL (time-to-live) expiration. For example, keeping per-user state forever in a Flink keyed state grows the state indefinitely as new users appear. Configure state TTL to evict entries for inactive keys, and monitor total state size per operator.
  • Choosing Spark Structured Streaming for true real-time (sub-100ms) use cases. Structured Streaming uses a micro-batch model with a minimum trigger interval, adding inherent latency. For sub-second processing, use Flink or Kafka Streams.
Related Concepts

See Stream Processing in action

Explore system design templates that use stream processing and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Process click streams in real-time with windowed aggregation

Metrics to watch
processing_latency_msthroughput_events_per_secwatermark_lag_mscheckpoint_duration_ms
Run Simulation
Test Your Understanding

1What problem do watermarks solve in stream processing?

2Why might you choose Apache Flink over Spark Structured Streaming for a fraud detection pipeline requiring sub-100ms latency?

Deeper Reading