Design a distributed job scheduler with leader-elected scheduling, Kafka-backed execution queues, tiered worker pools, and idempotency guarantees for reliable at-scale task execution.
A distributed job scheduler is a foundational infrastructure component that appears in system design interviews at companies building data pipelines, automation platforms, and backend services. The problem requires designing a system that reliably executes tasks at specified times or on recurring cron schedules, scales to thousands of jobs per second, and guarantees that each job runs exactly once even in the presence of failures. This tests a candidate's understanding of leader election, distributed queues, idempotency, and failure recovery.
In production, systems like Airflow, Celery, and Sidekiq power critical business workflows ranging from daily report generation to real-time data pipeline triggers. A large-scale job scheduler at companies like Uber or LinkedIn may coordinate millions of scheduled tasks across thousands of workers. The scheduler must handle both short-lived tasks (sending an email in 100ms) and long-running jobs (generating a multi-hour analytics report) without one type starving the other.
The core challenges include achieving sub-second scheduling precision so that jobs execute within 1 second of their scheduled time, preventing double execution when workers crash and jobs are retried, surviving scheduler node failures without missing scheduled triggers, and providing per-tenant isolation so that one user's workload surge does not delay another user's critical jobs. The scheduler must also maintain a complete execution history so that operators can audit when jobs ran, how long they took, and whether they succeeded or failed.
Candidates are expected to reason about the trade-off between a single-leader scheduler (simple but a potential bottleneck) versus distributed scheduling (scalable but prone to duplicate execution). They should also explain why message queues like Kafka decouple scheduling from execution, and why separate worker pools for short and long jobs prevent head-of-line blocking.
The architecture follows a database-backed queue pattern with a leader-elected scheduler, inspired by production systems like Airflow and Celery. A single JobService handles both the API layer (CRUD for job definitions, status queries, manual triggers) and the scheduling loop. One instance is elected leader via a PostgreSQL advisory lock, and this leader polls the jobs table every second for rows where next_run_at is less than or equal to the current time. Due jobs are pushed to Kafka for reliable, decoupled execution.
The API tier consists of an API Gateway for authentication and per-tenant rate limiting, an Application Load Balancer for traffic distribution, and 6 JobService pods running 80 threads each for 60K sustained RPS capacity. The leader instance runs the scheduler loop in addition to serving API traffic. If the leader dies, another instance acquires the advisory lock within 5-10 seconds and resumes scheduling. The leader creates an execution record with an idempotency key (job_id combined with scheduled_time) before publishing to Kafka, preventing duplicate execution even if the leader fails mid-publish.
The data layer includes PostgreSQL (16 partitions, 2 replicas) as the source of truth for job definitions and execution history, with a time-bucket index on next_run_at for efficient scheduler polling. Redis (2-node ElastiCache cluster) caches recent job execution state with an 80% hit rate, reducing database load for the frequent status-polling queries from monitoring dashboards. The execution queue uses Kafka (32 partitions, 500K msg/sec capacity) to buffer jobs between the scheduler and workers.
Two separate worker pools prevent head-of-line blocking. ShortJobWorker (40 instances) handles tasks under 1 minute such as email sends, webhook calls, and cache invalidations. LongJobWorker (10 instances with higher CPU and memory) handles tasks over 1 minute such as report generation, batch ETL, and database backups. Each worker checks the idempotency key before executing, writes results back to PostgreSQL, and updates the Redis cache. Long workers use heartbeat-based liveness detection so that stalled jobs are re-enqueued for another worker.
Choice
Single leader via PostgreSQL advisory lock
Rationale
A single leader scheduling loop eliminates the split-brain problem where multiple schedulers trigger the same job simultaneously. PostgreSQL advisory locks are simple and reliable. If the leader dies, another instance acquires the lock within seconds. At 10K jobs per second, a single scheduler can handle the polling and enqueue load, making the added complexity of distributed scheduling unnecessary at this scale.
Choice
Kafka instead of direct worker invocation
Rationale
Direct invocation couples scheduler availability to worker availability. If workers are slow or down, the scheduler blocks and stops processing due jobs. Kafka decouples scheduling from execution, buffering due jobs reliably with at-least-once delivery. Workers consume at their own pace with natural backpressure. Kafka also provides message replay for debugging failed executions and supports partitioning by job_id for ordering within a single job.
Choice
Separate short-job and long-job worker pools
Rationale
A 30-minute report generation job sitting in the same queue as a 100ms email send causes head-of-line blocking, where the quick task waits behind the slow one. Separate pools ensure that short tasks maintain low latency while long tasks get dedicated resources. This is the same tier-based worker pattern used by Sidekiq and Celery in production, and it allows independent scaling of each pool based on workload characteristics.
Choice
Composite key of job_id plus scheduled_time checked before execution
Rationale
Network partitions, worker crashes, and Kafka redelivery can cause the same job to be consumed by a worker twice. The idempotency key is checked before execution begins. If an execution record with that key already exists, the duplicate is skipped. This guarantees at-most-once execution semantics, which is critical for jobs with side effects like sending emails or transferring funds.
Target RPS
20K RPS peak API traffic; 10K jobs/sec peak execution throughput
Latency (p99)
p99 < 50ms for job CRUD; < 1s scheduling precision; < 10ms status queries from cache
Storage
PostgreSQL with 16 partitions for job definitions and execution history; 6 GB Redis cache for hot state
Availability
99.9% with < 10s leader failover gap; Kafka buffers jobs during worker outages
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
When the leader dies, the PostgreSQL advisory lock is released (either explicitly on graceful shutdown or automatically when the database connection drops). Other JobService instances attempt to acquire the lock on their next heartbeat cycle. A new leader is typically elected within 5-10 seconds. During this gap, no new jobs are scheduled from the database, but jobs already enqueued in Kafka continue executing on workers normally. The gap means some jobs may fire a few seconds late, but none are missed because the next leader picks up all overdue jobs on its first poll cycle.
Every execution gets a unique idempotency key composed of the job_id and the scheduled_time. The scheduler writes an execution record with this key to the database before publishing to Kafka. When a worker receives a job, it checks whether an execution record with that idempotency key already exists. If it does, the worker skips execution. This handles Kafka redelivery after worker crashes, scheduler retries after publish failures, and any other scenario that could cause duplicate consumption. The result is at-most-once execution semantics for each scheduled trigger.
Kafka does not natively support delayed delivery (deliver this message at time T), and SQS has a maximum delay of 15 minutes. Polling the database every second using a time-bucket index (WHERE next_run_at <= now) gives sub-second scheduling precision for all time horizons, from 1 second to 1 year in the future. The polling query only scans rows that are currently due, so performance scales with the number of due jobs per second rather than the total number of scheduled jobs. With a well-indexed column, this query executes in under 5ms even with millions of job definitions.
A common pattern is many cron jobs scheduled at midnight or the top of each hour. The scheduler polls once per second and enqueues all due jobs to Kafka in batch. Kafka's 500K messages per second capacity and 32 partitions absorb large bursts without backpressure reaching the scheduler. The worker pools then process the burst at their maximum throughput (40 short workers and 10 long workers). If the burst exceeds worker capacity, jobs queue in Kafka and are processed with a delay rather than being dropped. Per-tenant rate limiting at the API Gateway prevents any single tenant from monopolizing the execution pipeline.
This template supports independent jobs only. There is no mechanism for defining that Job B should run only after Job A completes, or for constructing directed acyclic graph (DAG) workflows. Adding DAG support requires a workflow engine that tracks dependency state, manages partial completion, and handles fan-in and fan-out patterns. Systems like Apache Airflow and Temporal provide DAG execution as a core feature. For the common case of independent scheduled tasks, the simpler single-job model avoids the complexity of dependency tracking and deadlock detection.
Sign in to join the discussion.
Ready to design your own Job Scheduler?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator