Vetora logo
⏱️Messaging & Streaming

Windowing in Stream Processing

Windowing divides an unbounded event stream into finite, time-bounded segments (windows) for aggregation and analysis. The three fundamental window types are tumbling (fixed, non-overlapping), hopping (fixed, overlapping), and session (dynamic, activity-based). Windowing is essential for computing time-based metrics like 'clicks per minute,' 'average latency over 5 minutes,' or 'purchases per user session.'

Overview

Batch processing operates on bounded datasets: 'process all of yesterday's logs.' Stream processing operates on unbounded data: events arrive continuously, forever. To compute aggregations (counts, averages, sums) on unbounded data, you need to define finite segments -- **windows** -- over which to compute.

The three fundamental window types are:

**Tumbling windows** divide time into fixed, non-overlapping intervals. A 1-minute tumbling window creates buckets [00:00-01:00), [01:00-02:00), [02:00-03:00), etc. Each event belongs to exactly one window. This is the simplest window type, used for metrics like 'requests per minute' or 'errors per hour.'

**Hopping windows** (also called sliding windows) have a fixed size and a fixed advance interval (hop). A 5-minute window with a 1-minute hop creates overlapping windows: [00:00-05:00), [01:00-06:00), [02:00-07:00), etc. Each event belongs to multiple windows (size/hop = 5 windows per event in this case). This smooths out metric spikes: instead of a sharp drop at window boundaries, the metric updates every hop interval.

**Session windows** are dynamic: they start with the first event and close after an inactivity gap (e.g., 30 minutes with no events). A user who clicks at 10:00, 10:05, 10:20, and 11:00 creates two sessions: [10:00-10:20] and [11:00-11:00]. Session windows capture user behavior patterns: page sessions, shopping sessions, or chat conversations.

A critical distinction is **event time** vs **processing time**. Event time is when the event occurred (embedded timestamp). Processing time is when the system processes the event. Due to network delays, batching, and retransmission, events may arrive out of order. A click at 10:00:05 may arrive at 10:00:12. If you use processing time, the click falls in the wrong window. Event-time windowing uses the embedded timestamp to place events in the correct window.

**Watermarks** solve the problem of knowing when a window is complete. A watermark is a declaration: 'all events with timestamp ≤ T have arrived.' When the watermark passes a window's end time, the window is closed and its result is emitted. Late events (arriving after the watermark) can be handled by allowed lateness: the window stays open for a grace period to accommodate stragglers.

Key Points
  • 1Tumbling windows: fixed-size, non-overlapping. Each event belongs to exactly one window. Simplest, used for basic time-bucketed metrics.
  • 2Hopping windows: fixed-size, overlapping. Each event belongs to size/hop windows. Smoother metrics, but higher computational cost.
  • 3Session windows: dynamic, bounded by inactivity gaps. Capture natural user sessions. Variable-length windows make them harder to implement and predict.
  • 4Event time vs processing time: event time uses the timestamp embedded in the event (when it happened). Processing time uses the clock when the event is processed (when it arrived). Event time is correct but requires watermarks; processing time is simple but inaccurate with out-of-order data.
  • 5Watermarks track event-time progress. When the watermark passes a window boundary, the window closes. Heuristic watermarks (based on observed event times) may be too early (closing windows before late events arrive) or too late (delaying results).
  • 6Late events (arriving after the watermark) can be handled by: (1) dropping them, (2) allowed lateness with window refire (update the result when late data arrives), or (3) side output to a separate channel for manual handling.
Simple Example

Click-Through Rate by Minute

An ad platform computes click-through rate (CTR) per ad per minute. Using a 1-minute tumbling window keyed by ad_id: each window collects all click events for a specific ad within that minute, counts them, divides by impressions (from another stream), and emits the CTR. At 10:01:00, the window [10:00-10:01) closes and emits 'ad_123 had 47 clicks / 1,200 impressions = 3.9% CTR.' A click event timestamped 10:00:58 that arrives at 10:01:03 (3 seconds late) is still assigned to the [10:00-10:01) window because event-time windowing uses the embedded timestamp, not the arrival time.

Real-World Examples

Uber

Uber uses session windows to detect driver trips. A trip session starts when the driver picks up a rider (first GPS event) and ends after 5 minutes of no movement (session gap). Within each session, Uber computes trip distance, duration, average speed, and route efficiency. Session windows naturally handle stops at traffic lights (short gaps within the session) while detecting trip completion (extended gap).

Netflix

Netflix uses hopping windows for real-time quality-of-experience monitoring. A 5-minute window hopping every 30 seconds computes average buffering ratio, bitrate, and error rate per title per region. The overlapping windows provide smooth trend lines on the monitoring dashboard, updating every 30 seconds with 5 minutes of context.

Twitter/X

Twitter uses tumbling windows for trending topic detection. A 1-minute tumbling window counts tweet volume per hashtag. The top-N hashtags by volume across the last window, with a velocity filter (comparing against the previous window), identify trending topics. The simplicity of tumbling windows enables processing millions of tweets per minute with predictable resource usage.

Trade-Offs
AspectDescription
Accuracy vs Latency (Watermark Strategy)Aggressive watermarks (advance quickly) emit results sooner but may miss late events. Conservative watermarks (advance slowly) capture more events but delay results. The allowed lateness parameter controls this trade-off: higher lateness means more complete results but more memory usage to keep windows open.
Window Size vs Resource UsageLarger windows consume more memory (more events buffered). Hopping windows multiply this by size/hop. Session windows with long gaps can grow very large. Operators must provision sufficient state storage (RocksDB for Flink, in-memory for Kafka Streams).
Event Time vs Processing Time ComplexityProcessing-time windows are trivial to implement (wall clock) but incorrect with out-of-order events. Event-time windows are correct but require watermark generation, state management for open windows, and late event handling. Most production systems use event time because correctness trumps simplicity.
Session Window Merging ComplexitySession windows require merging: when a new event arrives that falls within the gap of an existing session, the sessions merge. This is computationally expensive and creates variable-size state. Flink and Kafka Streams support session window merging, but it adds overhead compared to fixed windows.
Case Study

Spotify's Real-Time Listening Analytics

Scenario

Spotify needed to compute real-time listening statistics: plays per song per minute, average listen duration per album, and user session length. With billions of play events daily and events arriving from mobile devices with variable network latency (seconds to minutes of delay), processing-time windowing produced inaccurate metrics. A song play at 10:00 arriving at 10:03 would be counted in the 10:03 window instead of the 10:00 window.

Solution

Spotify migrated to event-time windowing using Apache Beam on Google Cloud Dataflow. Play events carry an embedded timestamp from the client device. Tumbling windows (1-minute) compute plays-per-song with event time. Session windows (30-minute gap) compute listening sessions per user. Watermarks are generated from the stream's observed event times with a 2-minute allowed lateness. Late events trigger window refires that update the aggregation.

Outcome

Metric accuracy improved from ~94% (processing time) to 99.7% (event time with 2-minute lateness). Real-time dashboards show plays-per-song within 3 minutes of the event (watermark delay + processing). Session analytics accurately capture listening behavior: average session length, session frequency, and genre transitions. The 2-minute allowed lateness captures 99.7% of late events; the remaining 0.3% (>2 minutes late, typically from reconnecting offline devices) are routed to a side output for batch reconciliation.

Common Mistakes
  • Using processing time when event time is needed. If events arrive out of order (which they always do in distributed systems), processing-time windows assign events to wrong windows. Use event time with watermarks for correctness.
  • Setting allowed lateness to zero. With zero lateness, any late event is dropped. Even well-connected systems have sub-second delays. Set allowed lateness to at least the 99th percentile of event delivery delay.
  • Forgetting to key windows by the entity that matters. A tumbling window without a key aggregates globally (all users, all ads). Always key by the entity: ad_id for per-ad CTR, user_id for per-user sessions, region for per-region metrics.
  • Using session windows when tumbling windows suffice. Session windows are more complex (merging, variable state size) and harder to reason about. If your use case is 'events per minute,' use tumbling windows. Reserve session windows for capturing natural activity patterns.
Related Concepts

See Windowing in Stream Processing in action

Explore system design templates that use windowing in stream processing and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Compare tumbling vs sliding windows for click aggregation

Metrics to watch
window_completeness_pctlate_event_rateaggregation_latency_msthroughput_rps
Run Simulation
Test Your Understanding

1What is the key difference between tumbling and hopping windows?

2Why are watermarks needed for event-time windowing?

Deeper Reading