Stream Processor

Compute

Real-time event processing engine that handles continuous data streams with windowing, state management, and exactly-once semantics.

Overview

A Stream Processor is a compute component that continuously consumes events from one or more data streams, performs transformations, aggregations, or pattern detection, and outputs results to downstream systems — all in real time. Unlike batch processing which operates on bounded datasets at scheduled intervals, stream processing handles unbounded, continuous data flows with low-latency results. This is the engine behind real-time analytics, fraud detection, IoT telemetry processing, and event-driven microservices architectures.

Windowing is the fundamental concept in stream processing. Since streams are infinite, you need a way to scope aggregations. Tumbling windows are fixed-duration, non-overlapping intervals (count events per minute). Sliding windows overlap and advance by a configurable step (count events in the last 5 minutes, updated every 30 seconds). Session windows group events by activity — a window starts with the first event and closes after a period of inactivity. Vetora models all three windowing strategies and shows how window size affects result latency and accuracy.

Exactly-once semantics is the holy grail of stream processing — guaranteeing that each event is processed exactly one time, with no duplicates and no losses, even in the face of failures. This is achieved through a combination of idempotent producers, transactional consumers, and checkpoint-based recovery. When a stream processor fails, it restarts from the last committed checkpoint and reprocesses any events since that point. Exactly-once is expensive — it requires coordination between the stream, the processor, and the sink — so many systems opt for at-least-once semantics with application-level deduplication.

State management enables complex stream processing operations like joins, aggregations, and pattern detection. Stateful processors maintain local state (e.g., a running count of events per user) that must survive failures and be redistributable when the processor scales. Kafka Streams uses RocksDB for local state with changelog topics for durability. Flink uses a pluggable state backend (memory, RocksDB) with distributed snapshots. Vetora models state size growth and shows how checkpointing frequency trades off recovery time versus processing overhead.

Stream processing architectures often follow the Kappa architecture — a single processing layer that handles both real-time and historical data by replaying events from the stream. This simplifies the architecture compared to the Lambda architecture (separate batch and stream layers) by treating the event log as the source of truth and deriving all views through stream processing.

When to Use

+Real-time analytics dashboards showing live metrics, aggregations, and trend detection
+Fraud detection systems that analyze transaction patterns and flag anomalies within seconds
+IoT telemetry processing aggregating sensor data, detecting threshold breaches, and triggering alerts
+Event-driven architectures where services react to domain events in real time (order placed, payment received)
+Change Data Capture (CDC) pipelines that propagate database changes to search indices, caches, or data lakes

Not Recommended

-Simple task-based async processing — use Worker with a job queue for independent, discrete tasks
-Batch analytics on bounded datasets that run on a schedule — batch processing frameworks are simpler and cheaper
-Request-response APIs — stream processing is for continuous flows, not individual request handling

Key Parameters in Vetora

Parameter	Description	Typical Values
parallelism	Number of parallel processing tasks. Typically matches the number of input partitions for maximum throughput.	4–64 parallel tasks
windowSizeMs	Duration of the processing window for aggregation operations (tumbling or sliding).	10,000–300,000ms (10 seconds to 5 minutes)
checkpointIntervalMs	How frequently the processor snapshots its state for failure recovery. Lower intervals reduce data loss but increase overhead.	10,000–60,000ms (10–60 seconds)
processingLatencyMs	Time from event ingestion to output production. Represents end-to-end processing delay.	10–1,000ms depending on complexity

Real-World Examples

Apache Kafka Streams

Stream processing library built on Kafka. Runs as a standard Java application (no separate cluster), uses RocksDB for local state, and supports exactly-once semantics through Kafka transactions.

Apache Flink

Distributed stream processing framework with true event-time processing, exactly-once state consistency, and sub-second latency. Used by Alibaba, Uber, and Netflix for large-scale real-time pipelines.

Apache Spark Streaming

Micro-batch stream processing that processes data in small, scheduled batches (100ms–seconds). Part of the Spark ecosystem, enabling seamless transitions between batch and stream processing.

Frequently Asked Questions

What is stream processing in system design?

Stream processing is the continuous, real-time processing of unbounded data flows. Unlike batch processing which operates on finite datasets at scheduled intervals, stream processing handles events as they arrive with low-latency results. Common operations include filtering, transformation, aggregation (count events per window), joins (correlate events from multiple streams), and pattern detection (identify sequences of events). It powers real-time analytics, fraud detection, IoT telemetry, and event-driven microservices.

What is the difference between tumbling, sliding, and session windows?

Tumbling windows are fixed-duration, non-overlapping intervals — every event belongs to exactly one window (e.g., count per minute). Sliding windows have a fixed duration but advance by a configurable step, creating overlapping windows — events can belong to multiple windows (e.g., count in the last 5 minutes, updated every 30 seconds). Session windows are data-driven — a window opens with the first event and closes after a configurable inactivity gap, grouping bursts of activity (e.g., a user's browsing session ends after 30 minutes of inactivity).

What does exactly-once semantics mean in stream processing?

Exactly-once semantics guarantees that each event is processed exactly one time with no duplicates and no losses, even when failures occur. It requires coordinated checkpointing — the processor periodically snapshots its position in the input stream and its accumulated state. On failure, it restarts from the last checkpoint. Achieving exactly-once end-to-end requires idempotent sinks or transactional output (Kafka transactions). It is more expensive than at-least-once due to coordination overhead, adding 10–30% latency.

How does state management work in stream processors?

Stateful stream processors maintain local state (counters, aggregations, lookup tables) that persists across events. For durability, state is backed by a changelog (Kafka Streams logs state changes to a Kafka topic) or periodic snapshots (Flink creates distributed snapshots). When a processor fails and restarts, it rebuilds state from the changelog or snapshot. When scaling out, state is redistributed across new processor instances using key-based partitioning. State is typically stored in embedded databases like RocksDB for fast local access.

When should you use Kafka Streams vs. Flink?

Use Kafka Streams for lightweight stream processing that runs as a standard application (no cluster to manage), is tightly integrated with Kafka, and handles moderate throughput. Use Apache Flink for complex event processing with advanced windowing, event-time handling, and exactly-once semantics at very high throughput (millions of events per second). Flink requires a separate cluster but provides more sophisticated state management, richer windowing primitives, and better support for out-of-order event handling.

Related Components

Event StreamStorage

Durable message streaming platform for pub/sub, event sourcing, and asynchronous communication betwe...

WorkerCompute

Background processor that handles asynchronous tasks from job queues, supporting retries, dead lette...

DatabaseStorage

Persistent data store supporting SQL or NoSQL models with ACID transactions, replication, sharding, ...

CacheStorage

In-memory data store that accelerates reads by serving frequently accessed data without querying the...

Try Stream Processor in the Simulator

Build architectures with Stream Processor and 13 other component types. Run discrete event simulations and get AI-powered feedback.

Open Playground