Stream Processor
ComputeReal-time event processing engine that handles continuous data streams with windowing, state management, and exactly-once semantics.
Overview
A Stream Processor is a compute component that continuously consumes events from one or more data streams, performs transformations, aggregations, or pattern detection, and outputs results to downstream systems — all in real time. Unlike batch processing which operates on bounded datasets at scheduled intervals, stream processing handles unbounded, continuous data flows with low-latency results. This is the engine behind real-time analytics, fraud detection, IoT telemetry processing, and event-driven microservices architectures.
Windowing is the fundamental concept in stream processing. Since streams are infinite, you need a way to scope aggregations. Tumbling windows are fixed-duration, non-overlapping intervals (count events per minute). Sliding windows overlap and advance by a configurable step (count events in the last 5 minutes, updated every 30 seconds). Session windows group events by activity — a window starts with the first event and closes after a period of inactivity. Vetora models all three windowing strategies and shows how window size affects result latency and accuracy.
Exactly-once semantics is the holy grail of stream processing — guaranteeing that each event is processed exactly one time, with no duplicates and no losses, even in the face of failures. This is achieved through a combination of idempotent producers, transactional consumers, and checkpoint-based recovery. When a stream processor fails, it restarts from the last committed checkpoint and reprocesses any events since that point. Exactly-once is expensive — it requires coordination between the stream, the processor, and the sink — so many systems opt for at-least-once semantics with application-level deduplication.
State management enables complex stream processing operations like joins, aggregations, and pattern detection. Stateful processors maintain local state (e.g., a running count of events per user) that must survive failures and be redistributable when the processor scales. Kafka Streams uses RocksDB for local state with changelog topics for durability. Flink uses a pluggable state backend (memory, RocksDB) with distributed snapshots. Vetora models state size growth and shows how checkpointing frequency trades off recovery time versus processing overhead.
Stream processing architectures often follow the Kappa architecture — a single processing layer that handles both real-time and historical data by replaying events from the stream. This simplifies the architecture compared to the Lambda architecture (separate batch and stream layers) by treating the event log as the source of truth and deriving all views through stream processing.
When to Use
Recommended
- +Real-time analytics dashboards showing live metrics, aggregations, and trend detection
- +Fraud detection systems that analyze transaction patterns and flag anomalies within seconds
- +IoT telemetry processing aggregating sensor data, detecting threshold breaches, and triggering alerts
- +Event-driven architectures where services react to domain events in real time (order placed, payment received)
- +Change Data Capture (CDC) pipelines that propagate database changes to search indices, caches, or data lakes
Not Recommended
- -Simple task-based async processing — use Worker with a job queue for independent, discrete tasks
- -Batch analytics on bounded datasets that run on a schedule — batch processing frameworks are simpler and cheaper
- -Request-response APIs — stream processing is for continuous flows, not individual request handling
Key Parameters in Vetora
Real-World Examples
Apache Kafka Streams
Stream processing library built on Kafka. Runs as a standard Java application (no separate cluster), uses RocksDB for local state, and supports exactly-once semantics through Kafka transactions.
Apache Flink
Distributed stream processing framework with true event-time processing, exactly-once state consistency, and sub-second latency. Used by Alibaba, Uber, and Netflix for large-scale real-time pipelines.
Apache Spark Streaming
Micro-batch stream processing that processes data in small, scheduled batches (100ms–seconds). Part of the Spark ecosystem, enabling seamless transitions between batch and stream processing.
Frequently Asked Questions
What is stream processing in system design?
Stream processing is the continuous, real-time processing of unbounded data flows. Unlike batch processing which operates on finite datasets at scheduled intervals, stream processing handles events as they arrive with low-latency results. Common operations include filtering, transformation, aggregation (count events per window), joins (correlate events from multiple streams), and pattern detection (identify sequences of events). It powers real-time analytics, fraud detection, IoT telemetry, and event-driven microservices.
What is the difference between tumbling, sliding, and session windows?
Tumbling windows are fixed-duration, non-overlapping intervals — every event belongs to exactly one window (e.g., count per minute). Sliding windows have a fixed duration but advance by a configurable step, creating overlapping windows — events can belong to multiple windows (e.g., count in the last 5 minutes, updated every 30 seconds). Session windows are data-driven — a window opens with the first event and closes after a configurable inactivity gap, grouping bursts of activity (e.g., a user's browsing session ends after 30 minutes of inactivity).
What does exactly-once semantics mean in stream processing?
Exactly-once semantics guarantees that each event is processed exactly one time with no duplicates and no losses, even when failures occur. It requires coordinated checkpointing — the processor periodically snapshots its position in the input stream and its accumulated state. On failure, it restarts from the last checkpoint. Achieving exactly-once end-to-end requires idempotent sinks or transactional output (Kafka transactions). It is more expensive than at-least-once due to coordination overhead, adding 10–30% latency.
How does state management work in stream processors?
Stateful stream processors maintain local state (counters, aggregations, lookup tables) that persists across events. For durability, state is backed by a changelog (Kafka Streams logs state changes to a Kafka topic) or periodic snapshots (Flink creates distributed snapshots). When a processor fails and restarts, it rebuilds state from the changelog or snapshot. When scaling out, state is redistributed across new processor instances using key-based partitioning. State is typically stored in embedded databases like RocksDB for fast local access.
When should you use Kafka Streams vs. Flink?
Use Kafka Streams for lightweight stream processing that runs as a standard application (no cluster to manage), is tightly integrated with Kafka, and handles moderate throughput. Use Apache Flink for complex event processing with advanced windowing, event-time handling, and exactly-once semantics at very high throughput (millions of events per second). Flink requires a separate cluster but provides more sophisticated state management, richer windowing primitives, and better support for out-of-order event handling.
Related Components
Durable message streaming platform for pub/sub, event sourcing, and asynchronous communication betwe...
Background processor that handles asynchronous tasks from job queues, supporting retries, dead lette...
Persistent data store supporting SQL or NoSQL models with ACID transactions, replication, sharding, ...
In-memory data store that accelerates reads by serving frequently accessed data without querying the...
Try Stream Processor in the Simulator
Build architectures with Stream Processor and 13 other component types. Run discrete event simulations and get AI-powered feedback.
Open Playground