1What determines which partition a Kafka record is written to?
Apache Kafka is a distributed, durable, high-throughput event streaming platform built around an append-only commit log. Its architecture of brokers, topics, partitions, and consumer groups enables millions of events per second with strong ordering guarantees per partition, configurable durability, and the ability to replay the full event history.
Apache Kafka was created at LinkedIn in 2010 to solve the problem of connecting all systems in a large organization to all data streams in real time. It was open-sourced in 2011 and became an Apache top-level project in 2012. Today it is the de facto standard for event streaming at scale.
Kafka's core abstraction is the **commit log**: an append-only, ordered, immutable sequence of records. A **topic** is a named log, and each topic is divided into one or more **partitions**. Each partition is an independent log stored on a single broker, with records assigned sequential offsets starting from 0. Partitions are the unit of parallelism: more partitions means more consumers can process data concurrently.
**Producers** write records to topics. Each record has an optional key; records with the same key are routed to the same partition (via hash(key) % num_partitions), ensuring ordering for related events. Records without a key are distributed round-robin across partitions.
**Consumer groups** coordinate consumption. Each partition in a topic is assigned to exactly one consumer within a group. If you have 12 partitions and 4 consumers in a group, each consumer handles 3 partitions. Adding a 5th consumer triggers rebalancing. If a consumer fails, its partitions are reassigned to surviving members. Multiple consumer groups can read the same topic independently, each maintaining its own offset -- this is how Kafka provides both fan-out and parallel processing.
**Brokers** store partitions on disk. Each partition has a leader broker (handles all reads and writes) and zero or more follower replicas on other brokers. The replication factor (typically 3) determines how many copies exist. Followers pull from the leader; once a follower is caught up, it joins the in-sync replica set (ISR). Producers can choose acknowledgment levels: acks=0 (fire-and-forget), acks=1 (leader only), or acks=all (all ISR replicas). With acks=all and min.insync.replicas=2, no acknowledged message is lost even if one broker dies.
Kafka's storage model is unique: records are retained for a configurable period (hours, days, or forever) regardless of whether they have been consumed. This enables replay: a new consumer group can start from offset 0 and reprocess the entire history, or a failing consumer can reset its offset and reprocess missed records. This replay capability is what distinguishes Kafka from traditional message queues.
Tracking Page Views
A website tracks page views by publishing each view to a Kafka topic 'page-views' with the user ID as the partition key. With 12 partitions, views for the same user always land in the same partition, preserving chronological order per user. Three consumer groups subscribe: (1) a real-time dashboard group with 4 consumers aggregating views per second, (2) an analytics group with 12 consumers (one per partition) writing to a data warehouse, and (3) a recommendations group with 6 consumers updating user interest profiles. Each group processes every view independently, at its own pace.
LinkedIn runs over 100 Kafka clusters processing 7+ trillion messages per day across 7+ petabytes of data. Every user action (profile view, connection, post, message) flows through Kafka. The activity data feeds real-time notifications, the news feed, analytics, ad targeting, and anti-abuse systems. Each system runs as an independent consumer group.
Netflix
Netflix uses Kafka as the backbone of its data pipeline, processing over 1 trillion events per day. Every streaming session, UI interaction, and playback quality metric flows through Kafka. The data feeds real-time A/B testing, content recommendations, infrastructure monitoring, and billing. Netflix contributed the open-source Kafka consumer rebalance protocol improvements.
Walmart
Walmart uses Kafka for real-time inventory tracking across 4,700+ stores. When an item is sold at a register, scanned in a warehouse, or restocked on a shelf, the event is published to Kafka. Consumer groups power real-time inventory counts, replenishment triggers, pricing updates, and the online pickup-availability feature that shows customers what is in stock at their local store.
| Aspect | Description |
|---|---|
| Throughput vs Latency | Kafka achieves massive throughput by batching records in the producer (linger.ms, batch.size) and writing sequentially to disk. Batching increases throughput but adds latency. For sub-millisecond latency requirements, a traditional message broker (RabbitMQ, Redis Streams) may be more appropriate. Kafka's sweet spot is high throughput with latency in the 5-50ms range. |
| Partition Count vs Operational Complexity | More partitions means more parallelism but also more file handles, longer leader elections, and slower consumer group rebalancing. LinkedIn recommends ā¤4,000 partitions per broker and ā¤200,000 per cluster. Over-partitioning is a common mistake that increases end-to-end latency during rebalances. |
| Retention Duration vs Storage Cost | Long retention (days or weeks) enables replay and reprocessing but requires significant disk space. At 1 million messages/sec at 1KB each, one day of retention is ~86 GB per partition replica. Tiered storage (Kafka 3.6+) offloads old segments to object storage (S3), keeping hot data on broker SSDs and cold data in cheap storage. |
| Ordering vs Parallelism | Kafka guarantees ordering only within a partition. If you need total ordering across all events, you need a single partition -- which limits you to one consumer per group. The compromise is to partition by entity key (user_id, order_id), giving per-entity ordering with multi-partition parallelism. |
The LinkedIn Log: From Database Synchronization to Event Streaming Platform
Scenario
By 2010, LinkedIn had dozens of data systems (search index, social graph, recommendation engine, Hadoop warehouse) that all needed copies of the same data. Point-to-point pipelines created an O(N²) integration problem: every new system needed a pipeline from every data source. Changes to one system's schema would break multiple downstream consumers.
Solution
Jay Kreps, Neha Narkhede, and Jun Rao designed Kafka as a unified commit log that all systems could publish to and consume from. Each data source publishes change events to Kafka topics. Each consuming system runs a consumer group that reads from the relevant topics at its own pace. The log serves as the single source of truth, decoupling producers from consumers. LinkedIn published this concept as 'The Log' in a 2013 blog post that became one of the most influential distributed systems articles.
Outcome
LinkedIn reduced its integration complexity from O(N²) to O(N): each system connects to Kafka once. New systems are onboarded by creating a consumer group -- no changes to producers. Kafka grew from an internal tool to the most widely adopted event streaming platform, powering over 80% of Fortune 100 companies. The project spawned Confluent (valued at over $8 billion) and influenced the design of AWS Kinesis, Azure Event Hubs, and Google Pub/Sub.
See Kafka Architecture in action
Explore system design templates that use kafka architecture and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What determines which partition a Kafka record is written to?
2What happens if a Kafka consumer group has more consumers than partitions?