1What problem do watermarks solve in stream processing?
Stream processing handles unbounded, continuously arriving data in real time or near-real time. Engines like Apache Flink, Kafka Streams, and Spark Structured Streaming process events as they arrive, enabling sub-second latency for use cases like fraud detection, real-time dashboards, and session analytics.
Stream processing treats data as an infinite, continuously arriving sequence of events rather than a bounded dataset. Instead of waiting for all data to arrive and then processing it, a stream processor ingests each event as it is produced and computes results incrementally. This enables latencies measured in milliseconds or seconds rather than hours.
The foundational challenge of stream processing is **time semantics**. Events have two timestamps: the time they occurred (**event time**) and the time they arrive at the processor (**processing time**). These can differ by seconds, minutes, or even hours due to network delays, mobile device buffering, or backfills. Event-time processing -- grouping events by when they actually happened, not when they arrived -- is essential for correct analytics but requires **watermarks**: heuristic signals that tell the engine 'all events up to time T have probably arrived.' Flink's watermark mechanism is the industry standard.
**Windowing** is how stream processors group unbounded data into finite chunks for aggregation. Tumbling windows (fixed, non-overlapping: e.g., every 5 minutes), sliding windows (overlapping: 5-minute window sliding every 1 minute), and session windows (grouped by activity gaps: a session ends after 30 minutes of inactivity) each serve different use cases. Window semantics interact with watermarks to determine when a window's result is emitted and how late data is handled.
Apache **Flink** (2015) is the most capable open-source stream processor, offering event-time semantics, exactly-once state consistency via distributed snapshots (Chandy-Lamport algorithm), and sophisticated windowing. **Kafka Streams** takes a different approach: a lightweight library embedded in your application (no separate cluster), ideal for simpler transformations and microservice-scale stream processing. **Spark Structured Streaming** bridges batch and streaming with a micro-batch execution model, sacrificing true per-event latency for Spark's rich ecosystem.
Real-Time Fraud Detection
A payment processor uses Flink to detect fraudulent transactions in real time. Each payment event flows from Kafka into a Flink pipeline. The pipeline enriches each event with the user's recent transaction history (stored in Flink's keyed state -- the last 100 transactions per user), computes velocity features (number of transactions in the last 10 minutes using a sliding window), and applies an ML scoring model. If the fraud score exceeds a threshold, the transaction is flagged and routed to a review queue -- all within 200 milliseconds of the transaction occurring. A batch system doing this hourly would miss fraudulent bursts entirely.
Alibaba
Alibaba runs the world's largest Flink deployment, processing over 40 billion events per second during Singles' Day (11.11). Real-time stream processing powers live sales dashboards, dynamic pricing, inventory tracking, and fraud detection. Their internal Blink fork of Flink processes petabytes per hour with sub-second latency, enabling the real-time GMV ticker shown during the event.
Netflix
Netflix uses Flink for real-time data enrichment, session analytics, and operational monitoring. Their Keystone streaming platform processes over 1.5 trillion events per day from microservice logs, player events, and user interactions. Flink pipelines compute per-title viewing metrics in real time, enabling immediate detection of playback quality degradation (buffering ratio spikes) and triggering automated CDN failovers within minutes.
Uber
Uber's AthenaX platform (built on Flink) processes rider and driver events in real time for dynamic pricing (surge), ETA computation, and marketplace matching. The streaming pipeline computes supply-demand ratios per geographic hex cell every 30 seconds using tumbling windows. This feeds the surge pricing algorithm, which must react to demand changes faster than any batch system could provide.
| Aspect | Description |
|---|---|
| Latency vs Completeness | Stream processing delivers results in milliseconds but must cope with incomplete data. Late-arriving events may arrive after a window has closed. The watermark mechanism trades off latency (how long to wait) against completeness (how much late data to include). Aggressive watermarks give fast results but miss more late data; conservative watermarks are more complete but add latency. |
| Operational Complexity vs Real-Time Value | Stream processing requires managing stateful operators, checkpoints, watermarks, and back-pressure -- significantly more complex than batch. You need monitoring for consumer lag, checkpoint duration, and state size. This complexity is justified only when the business value of real-time data (fraud prevention, live dashboards, instant personalization) exceeds the operational cost. |
| State Management vs Scalability | Stateful stream processing (e.g., maintaining per-user session state) is powerful but challenging. State grows with the number of keys and the retention window. Flink stores state in RocksDB with periodic checkpoints to S3; Kafka Streams uses compacted changelog topics. Large state (hundreds of GB per operator) increases checkpoint time and recovery time, limiting scalability. |
| Exactly-Once vs Performance | Exactly-once semantics in streaming requires coordinating source offsets, operator state, and sink writes in a single atomic unit. Flink's two-phase commit sink and Kafka's transactional producer achieve this but add latency (commit intervals of 100ms-1s) and reduce throughput by 10-30%. At-least-once processing is simpler and faster but requires idempotent sinks. |
LinkedIn's Unified Streaming Platform with Samza and Kafka
Scenario
LinkedIn needed to process billions of events per day in real time for feed ranking, notification delivery, ad targeting, and anti-abuse detection. Their initial approach used custom consumer applications reading from Kafka, but these lacked proper state management, fault tolerance, and windowing -- each team reinvented these primitives differently, leading to correctness bugs and operational burden.
Solution
LinkedIn developed Apache Samza, a stream processing framework tightly integrated with Kafka. Samza uses Kafka topics for both input and changelog-based state recovery. Each Samza job partitions state by key (aligned with Kafka partition assignment), stores state locally in RocksDB, and replicates state changes to a compacted Kafka changelog topic. On failure, a new container replays the changelog to restore state.
Outcome
LinkedIn's streaming platform processes over 5 trillion messages per day across thousands of Samza jobs. Feed relevance improved because ranking signals (who viewed what, engagement patterns) update in seconds rather than hours. Notification delivery latency dropped from minutes to under 5 seconds. The standardized framework eliminated correctness bugs from ad-hoc state management.
See Stream Processing in action
Explore system design templates that use stream processing and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What problem do watermarks solve in stream processing?
2Why might you choose Apache Flink over Spark Structured Streaming for a fraud detection pipeline requiring sub-100ms latency?