Production-grade distributed job scheduler with N independent scheduler shards using in-memory time-wheels for sub-second scheduling precision. Sharded JobDB by hash(job_id), separate worker pools for short and long tasks, dead-letter queue for permanently failed jobs, DAG support for multi-step workflows, and idempotency via unique execution IDs.
The sharded time-wheel approach to job scheduling represents the production architecture used by FAANG-scale scheduling systems like Google Cloud Scheduler, AWS Step Functions, and Temporal's internal task distribution. It solves the two fundamental limitations of the leader-elected approach (V1): the single-leader polling bottleneck and the lack of DAG workflow support.
The key insight is replacing database polling with an in-memory time-wheel. A time-wheel is a circular buffer of slots, each representing one second. When a job is created, it is inserted into the slot corresponding to its next_run_at time. Each second, the wheel advances by one slot and fires all jobs in the current slot — no database query needed. This eliminates the constant polling load that grows with job count. At 1M active jobs per shard, the time-wheel fires jobs from the current slot in O(K) where K is the number of due jobs (typically 100-1000), compared to V1's polling query which scans the entire time-bucket index (O(log N + K)).
Sharding distributes the scheduling load across N independent scheduler instances. Each shard owns a partition of jobs determined by hash(job_id) mod N. With 3 shards, each shard manages approximately 333K jobs out of 1M total. If one shard fails, only 1/3 of jobs are affected — the other 2/3 continue scheduling without interruption. The failed shard's partition is reassigned to a healthy shard within seconds via a coordination service (etcd or ZooKeeper). This is a critical improvement over V1 where a leader failure halts all scheduling for 5-10 seconds.
DAG (Directed Acyclic Graph) support enables multi-step workflows where the completion of one job triggers the next. The job_dependencies table stores parent-child relationships. When ResultProcessor sees a parent job complete successfully, it queries for children whose all parents have completed (topological readiness check) and triggers them by publishing to JobQueue. This is the same dependency resolution algorithm used by Apache Airflow, Temporal Workflows, and Prefect. Cycle detection is enforced at job creation time — attempting to create a circular dependency returns a 400 error.
The dead-letter queue (DLQ) captures permanently failed jobs (those that exceeded max retries) with full context: execution_id, job definition, error message, retry history. AlertService monitors DLQ depth and triggers PagerDuty when failures exceed thresholds. This replaces V0's silent failure mode where failed jobs were marked in the database and forgotten.
Worker pools are separated by estimated task duration. WorkerPoolShort (60 pods) handles tasks under 1 minute with high parallelism. WorkerPoolLong (20 pods) handles tasks over 1 minute with more CPU and memory per pod. Heartbeat-based liveness detection on long workers ensures that stuck tasks are detected and re-enqueued within 60 seconds.
The primary trade-offs are operational complexity (15 components including 3 scheduler shards, 2 Kafka clusters, Redis, PostgreSQL, ResultProcessor, DLQ, and AlertService) and time-wheel cold-start latency (5-10 seconds to rebuild the wheel from JobDB on shard restart). These are justified at scale exceeding 100K active jobs where simpler architectures hit their throughput ceiling.
Interviewers expect candidates to explain the time-wheel data structure, discuss shard ownership and failover, reason about DAG dependency resolution (topological sort), and analyze the DLQ pattern for failure observability.
The sharded time-wheel architecture uses fifteen components organized into five layers: traffic ingestion (SchedulerClient, ApiGateway, SchedulerLB), CRUD API (SchedulerAPI), scheduling (SchedulerShardA, SchedulerShardB, SchedulerShardC), data stores (JobCache/Redis, JobDB/PostgreSQL), execution (JobQueue/Kafka, WorkerPoolShort, WorkerPoolLong), and result processing (ResultStream/Kafka, ResultProcessor, DLQ/SQS, AlertService).
The CRUD path is fully separated from the scheduling path. SchedulerAPI (8 pods) handles all external requests: job creation, status queries, deletion, pause/resume, manual execution, and DAG dependency management. It writes to JobDB (sharded by hash(job_id)) and caches state in JobCache (Redis, 85% hit rate). On job creation, SchedulerAPI notifies the owning scheduler shard to load the new job into its time-wheel. This notification is a lightweight internal RPC that adds approximately 2ms to the creation path.
The scheduling path operates independently of the CRUD path. Three scheduler shards (A, B, C) each maintain an in-memory time-wheel with 3600 slots (1-second resolution, 1-hour horizon). Shard A owns jobs where hash(job_id) mod 3 == 0, Shard B owns mod 3 == 1, Shard C owns mod 3 == 2. On startup, each shard queries its partition from JobDB to populate the wheel (cold start: 5-10 seconds for 333K jobs). During steady state, no DB queries are needed — the wheel fires jobs from the current slot. Each second, the shard advances the wheel and publishes due jobs to JobQueue (Kafka) with unique execution_ids.
JobQueue (Amazon MSK, 64 partitions, 1M msg/sec capacity) distributes execution events to two worker pools. WorkerPoolShort (60 pods, 2 vCPU / 4 GB) handles tasks under 1 minute with high parallelism. WorkerPoolLong (20 pods, 4 vCPU / 8 GB) handles heavy tasks with heartbeat-based liveness detection. Both pools check execution_id idempotency before running. On completion, workers publish results to ResultStream (a separate Kafka cluster).
ResultProcessor (15 workers) consumes from ResultStream and performs three actions: (1) updates JobDB with execution status and duration, (2) for DAG jobs, checks if all parent dependencies are satisfied and triggers children by publishing to JobQueue, (3) for permanently failed jobs (retry_count >= max_retries), routes the execution record to DLQ (SQS) and triggers AlertService.
DLQ (Amazon SQS) stores permanently failed jobs with full context for manual inspection. Messages are retained for 14 days. AlertService (2 pods) polls DLQ depth every 10 seconds and triggers PagerDuty when the count exceeds 100 messages or when a scheduler shard stops heartbeating.
Horizontal scaling: add scheduler shards to increase scheduling throughput (linear scaling). Add workers to increase execution throughput. Add Kafka partitions for higher message throughput. Shard rebalancing is managed by the coordination service — adding a 4th shard redistributes 25% of jobs from each existing shard to the new one via consistent hashing.
This sequence diagram traces four critical flows: job creation with time-wheel insertion, time-wheel tick firing, DAG child triggering via ResultProcessor, and dead-letter routing for permanently failed jobs.
The key insight is that the time-wheel replaces DB polling entirely. Once a job is loaded into the wheel (on creation or cold start), it fires from memory with zero DB I/O. The second insight is the ResultProcessor's dual role: it updates execution status and also handles DAG dependency resolution — checking if all parents are satisfied before triggering children.
Step-by-Step Walkthrough
Pseudocode
// TIME-WHEEL STRUCTURE
class TimeWheel:
slots: Array[3600] // 1 slot per second, 1-hour horizon
pointer: int = 0 // Current slot index
function insert(job, next_run_at):
slot_index = next_run_at.unix_seconds % 3600
slots[slot_index].add(job)
function tick():
pointer = (pointer + 1) % 3600
due_jobs = slots[pointer] // O(K) — K jobs in this slot
slots[pointer] = [] // Clear the slot
events = []
for job in due_jobs:
exec_id = hash(job.job_id + current_time)
events.push({ execution_id: exec_id, job_id: job.job_id, task: job.task })
// Reinsert recurring jobs into future slot
if job.schedule.startsWith("cron:"):
next = compute_next_cron(job.schedule)
insert(job, next)
kafka.publishBatch("job-execution", events) // ~4ms for 1000 events
// DAG DEPENDENCY RESOLUTION (in ResultProcessor)
async function processResult(result):
await db.execute("UPDATE job_executions SET status=$1 WHERE execution_id=$2",
[result.status, result.execution_id])
if result.status == "completed":
// Mark parent dependency as satisfied
await db.execute(
"UPDATE job_dependencies SET status='satisfied' WHERE parent_job_id=$1",
[result.job_id])
// Find children whose ALL parents are now satisfied
ready_children = await db.query(
"SELECT child_job_id FROM job_dependencies " +
"WHERE parent_job_id = $1 AND status = 'satisfied' " +
"AND NOT EXISTS (SELECT 1 FROM job_dependencies " +
"WHERE child_job_id = job_dependencies.child_job_id AND status != 'satisfied')",
[result.job_id])
for child_id in ready_children:
child_job = await db.query("SELECT * FROM jobs WHERE job_id = $1", [child_id])
exec_id = hash(child_id + current_time)
await kafka.publish("job-execution", {
execution_id: exec_id, job_id: child_id,
task: child_job.task, params: child_job.params
})
else if result.status == "failed" and result.retry_count >= MAX_RETRIES:
await db.execute("UPDATE job_executions SET status='dead-lettered' WHERE execution_id=$1",
[result.execution_id])
await sqs.sendMessage(DLQ_URL, { execution_id: result.execution_id, error: result.error_message })
await alertService.trigger("job_permanently_failed", result)The V2 schema adds two key tables beyond V1: job_dependencies for DAG workflows and expanded job_executions with dead-letter status tracking. The jobs table gains shard_id for partition ownership and estimated_duration for worker pool routing. The three tables together support the full lifecycle: job definition (jobs), execution tracking with idempotency (job_executions), and DAG dependency resolution (job_dependencies).
The jobs table is sharded by hash(job_id) across 3 partitions matching the scheduler shards. The shard_id column determines which scheduler shard owns the job and loads it into its time-wheel. The job_executions table uses the same deterministic execution_id as V1 for idempotency, with the addition of dead-lettered status. The job_dependencies table stores directed edges in the DAG with a unique constraint preventing duplicate edges.
Step-by-Step Walkthrough
Pseudocode
-- JOBS TABLE: Sharded by hash(job_id), time-wheel source
CREATE TABLE jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
schedule TEXT NOT NULL,
task TEXT NOT NULL,
params JSONB,
next_run_at TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL DEFAULT 'active',
estimated_duration TEXT NOT NULL DEFAULT 'short',
shard_id INTEGER NOT NULL, -- hash(job_id) % 3
tenant_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_jobs_shard ON jobs (shard_id, next_run_at)
WHERE status = 'active'; -- Cold-start wheel rebuild
-- JOB_EXECUTIONS TABLE: Idempotent + dead-letter tracking
CREATE TABLE job_executions (
execution_id UUID PRIMARY KEY, -- hash(job_id + scheduled_time)
job_id UUID NOT NULL REFERENCES jobs(job_id),
status TEXT NOT NULL DEFAULT 'pending',
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
duration_ms INTEGER,
retry_count INTEGER NOT NULL DEFAULT 0,
error_message TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_executions_job ON job_executions (job_id, created_at DESC);
CREATE INDEX idx_executions_failed ON job_executions (status)
WHERE status IN ('failed', 'dead-lettered');
-- JOB_DEPENDENCIES TABLE: DAG edges
CREATE TABLE job_dependencies (
dependency_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
parent_job_id UUID NOT NULL REFERENCES jobs(job_id),
child_job_id UUID NOT NULL REFERENCES jobs(job_id),
status TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (parent_job_id, child_job_id)
);
CREATE INDEX idx_deps_parent ON job_dependencies (parent_job_id, status);
CREATE INDEX idx_deps_child ON job_dependencies (child_job_id, status);
-- DAG TRIGGER QUERY (runs in ResultProcessor on parent completion)
UPDATE job_dependencies SET status = 'satisfied'
WHERE parent_job_id = $1;
SELECT child_job_id FROM job_dependencies
WHERE parent_job_id = $1
AND NOT EXISTS (
SELECT 1 FROM job_dependencies d2
WHERE d2.child_job_id = job_dependencies.child_job_id
AND d2.status != 'satisfied'
); -- Returns children where ALL parents are now satisfiedChoice
In-memory circular buffer with 3600 slots (1-second resolution) instead of SQL polling
Rationale
The time-wheel eliminates the constant DB polling overhead. In V1, the leader runs SELECT ... WHERE next_run_at <= NOW() every second — a query that scans the time-bucket index even when no jobs are due. With 1M active jobs, this scan takes 50-100ms per cycle. The time-wheel fires in O(K) where K is the number of due jobs in the current slot, with zero DB I/O. The trade-off is memory (approximately 1 GB per shard for 333K jobs) and cold-start latency (5-10 seconds to rebuild from DB on restart).
Choice
N independent scheduler shards, each owning hash(job_id) mod N partition
Rationale
A single leader (V1) bottlenecks at 10K jobs/sec. With 3 shards, each handles approximately 7K jobs/sec, tripling total throughput to 20K+ jobs/sec. More importantly, shard isolation limits the blast radius of failures: if Shard A crashes, only 1/3 of jobs are delayed while Shards B and C continue normally. In V1, a leader crash halts all scheduling. Shard ownership is managed by etcd/ZooKeeper — on shard failure, the orphaned partition is reassigned to a healthy shard within seconds.
Choice
job_dependencies table with topological readiness checks on parent completion
Rationale
Many real-world workflows require job-to-job dependencies: ETL pipeline stages (extract -> transform -> load), CI/CD (build -> test -> deploy), data pipelines (ingest -> validate -> aggregate). Without DAG support (V0 and V1), these workflows require manual orchestration or external tools like Airflow. The ResultProcessor handles dependency resolution: when a parent completes, it queries SELECT child_job_id FROM job_dependencies WHERE parent_job_id = ? AND status = 'satisfied', checks if all parents of each child are complete, and triggers ready children.
Choice
SQS DLQ with AlertService monitoring instead of silent database-only failure
Rationale
In V0, failed jobs are marked in the database and forgotten. In V1, they are logged but not surfaced. The DLQ pattern ensures permanently failed jobs are captured with full context (execution_id, error message, retry history, job definition) for manual inspection and replay. AlertService polls DLQ depth every 10 seconds and triggers PagerDuty when thresholds are exceeded. This is the standard failure observability pattern used by AWS Lambda DLQ, SQS DLQ, and Kafka error topics.
Choice
Dedicated Kafka cluster for job results instead of reusing JobQueue
Rationale
Separating the result stream from the job queue prevents result processing from competing with job dispatch for Kafka throughput. JobQueue handles 20K+ msg/sec of execution events. ResultStream handles 20K+ msg/sec of completion results. Using a single Kafka cluster would require 40K+ msg/sec capacity and careful topic/partition management to prevent interference. Separate clusters also enable independent scaling and failure isolation.
Choice
execution_id = hash(job_id + scheduled_time) with INSERT ... ON CONFLICT
Rationale
Shard failover can cause the same job to be fired by two shards briefly (old shard fires before failover, new shard fires after). Kafka redelivery can cause the same execution event to be consumed twice. The deterministic execution_id ensures that for a given job at a given scheduled time, only one execution can proceed. Workers check via INSERT ... ON CONFLICT (execution_id) DO NOTHING — an atomic, lock-free deduplication mechanism.
Target RPS
~40K peak (6K CRUD + 22K status + 12K execution/DAG)
Latency (p99)
<100ms schedule precision, 50ms CRUD p99, 8ms status (cache hit)
Storage
~500 GB/year (sharded DB + 2 Kafka clusters + DLQ + execution history)
Availability
99.95% (shard isolation, DLQ, multi-AZ, heartbeat monitoring)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Time-wheel tick (fire due jobs from current slot) | O(K) — K jobs in the current slot | O(J/S) — J total jobs, S shard count, stored in the wheel | Each second, the wheel advances and fires all jobs in the current slot. Typical K = 100-1000 jobs per slot per shard. No DB query — pure in-memory operation. The wheel consumes approximately 64 MB per 333K jobs (200 bytes per job slot entry). |
| Cold-start wheel rebuild (SELECT ... WHERE shard_id = ?) | O(J/S) — scan all jobs in the shard's partition | O(J/S) — load all jobs into memory | On shard restart, the full partition must be loaded: SELECT * FROM jobs WHERE shard_id = ? AND status = 'active'. For 333K jobs, this takes 5-10 seconds. During this window, jobs may fire late. A warm standby shard can reduce this gap. |
| DAG readiness check (all parents completed?) | O(P) — P parents of the child job | O(1) — constant per check | For each completed parent, ResultProcessor queries: SELECT COUNT(*) FROM job_dependencies WHERE child_job_id = ? AND status != 'satisfied'. If count == 0, all parents are done and the child is triggered. Typical P = 1-5 parents per job in production DAGs. |
| Cycle detection (DFS on dependency graph) | O(V + E) — V jobs, E dependencies in the connected component | O(V) — visited set | Runs at INSERT time for new dependencies. Starting from child_job_id, DFS follows existing edges to check if it can reach parent_job_id. In practice, DAGs are small (10-50 jobs) so the DFS completes in microseconds. |
Stores job definitions sharded by hash(job_id). Each scheduler shard queries its partition on cold start to populate the time-wheel. Includes estimated_duration for worker pool routing and shard_id for partition ownership.
Indexes: PK on job_id, idx_jobs_shard ON (shard_id, next_run_at) WHERE status = 'active' (cold-start wheel rebuild), idx_jobs_tenant ON (tenant_id, status) (per-tenant job listing)
The shard_id column determines which scheduler shard owns the job. On cold start, the shard queries SELECT * FROM jobs WHERE shard_id = ? AND status = 'active' to populate its time-wheel. During steady state, no DB queries are needed — the wheel fires from memory. shard_id is computed as hash(job_id) mod N at job creation time.
Append-only execution history with unique execution_id for idempotency. Includes retry_count and error_message for failure tracking. Rows with status='dead-lettered' have been routed to the DLQ.
Indexes: PK on execution_id (idempotency check via INSERT ... ON CONFLICT), idx_executions_job ON (job_id, created_at DESC) (per-job history), idx_executions_status ON (status) WHERE status IN ('failed','dead-lettered') (failure dashboard)
The execution_id is deterministic: hash(job_id + scheduled_time). This enables INSERT ... ON CONFLICT (execution_id) DO NOTHING for atomic deduplication. The partial index on failed/dead-lettered status supports the failure dashboard without scanning the entire table.
DAG dependency graph for workflow orchestration. Each row represents a directed edge: parent_job_id must complete before child_job_id can fire. Cycle detection is enforced at write time. ResultProcessor queries this table on parent completion to find ready children.
Indexes: PK on dependency_id, idx_deps_parent ON (parent_job_id, status) (find children to trigger on parent completion), idx_deps_child ON (child_job_id, status) (check if all parents are satisfied for a child), UNIQUE (parent_job_id, child_job_id) (prevent duplicate edges)
Cycle detection runs at INSERT time: a DFS from child_job_id through existing edges checks if it can reach parent_job_id. If so, the INSERT is rejected with a 400 error. On parent completion, ResultProcessor updates status to 'satisfied' and checks if all of the child's parents are satisfied — if so, the child is triggered.
Published by scheduler shards when a job's time-wheel slot fires. Consumed by WorkerPoolShort and WorkerPoolLong based on estimated_duration filtering. Also published by ResultProcessor when triggering DAG children.
Key Schema
job_id (string) — partition key for per-job ordering
Value Schema
{ execution_id: string (idempotency key), job_id: string, task: string, params: object, estimated_duration: string (short|long), shard_id: integer, scheduled_time: string (ISO timestamp) }
Published by workers on task completion (success or failure). Consumed by ResultProcessor for status updates, DAG triggering, and DLQ routing.
Key Schema
job_id (string) — partition key for per-job ordering
Value Schema
{ execution_id: string, job_id: string, status: string (completed|failed), duration_ms: integer, retry_count: integer, error_message: string, output: string (truncated to 10KB) }
Scheduler Shard A crashes (host failure, OOM, network partition)
Impact
Jobs in Shard A's partition (hash(job_id) mod 3 == 0, approximately 333K jobs) stop firing. Shards B and C continue scheduling 2/3 of all jobs without interruption. The coordination service detects the heartbeat timeout within 5 seconds and reassigns the partition. The receiving shard rebuilds its time-wheel for the new partition (5-10 seconds). Total recovery: 10-15 seconds. During this window, approximately 3K jobs (10 seconds x 333 due/sec) may fire 10-15 seconds late.
Mitigation
Warm standby shard: maintain a standby shard that mirrors the primary's time-wheel but does not fire. On primary failure, the standby activates within 1-2 seconds. Alternatively, reduce heartbeat timeout from 5 seconds to 2 seconds for faster detection.
DAG parent job fails permanently (exceeds max retries)
Impact
The parent job is dead-lettered. Its children never receive a trigger — the DAG is stuck in a partial state. Some branches may have completed (if they had no dependency on the failed parent), while the failed parent's children and their descendants are blocked indefinitely. The DAG status shows 'partial' in the dashboard.
Mitigation
Manual intervention via the CRUD API: POST /api/v1/jobs/{parent_id}/force-complete marks the failed parent as 'force-completed', satisfying its dependencies and allowing children to trigger. Alternatively, POST /api/v1/jobs/{parent_id}/skip marks the parent as skipped and triggers children with a flag indicating the parent was not actually executed. Both require operator judgment about whether skipping the parent is safe.
ResultProcessor falls behind (consumer lag grows on ResultStream)
Impact
Execution results are delayed: job statuses in JobDB and JobCache are stale. DAG children are delayed because their parent completions have not been processed. DLQ routing is delayed — failed jobs are not alerted on immediately. The delay is transparent to workers (they continue executing) but affects the control plane: dashboards show outdated status and DAG workflows slow down.
Mitigation
Auto-scale ResultProcessor workers based on Kafka consumer lag (target: <1K lag). Increase ResultStream partition count to enable more parallel consumers. Add a circuit breaker: if lag exceeds 10K messages, temporarily skip DAG dependency checks and process status updates only (catch up faster).
Time-wheel memory exhaustion (too many jobs in a single shard)
Impact
If a shard owns more than 2M jobs, the time-wheel consumes approximately 400 MB — still within the 8 GB allocation. But at 5M+ jobs per shard, memory pressure triggers GC pauses that cause the wheel to miss tick deadlines, firing jobs 100ms-1s late. At 10M+ jobs per shard, OOM kills the process.
Mitigation
Add more scheduler shards to reduce per-shard job count. With 6 shards instead of 3, each shard owns approximately 167K jobs (at 1M total). Monitor per-shard job count and trigger shard addition when any shard exceeds 500K jobs. The coordination service handles rebalancing automatically.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| Scheduler Shard (any of 3) | Shard crash, OOM, or network partition from JobDB | 1/3 of jobs stop firing. Other shards are unaffected. Recovery: 10-15 seconds (heartbeat timeout + wheel rebuild). During recovery, affected jobs fire late but are not lost. | Multi-AZ shard deployment (each shard in a different AZ). Warm standby for each shard. Heartbeat monitoring with PagerDuty alerting. Coordination service auto-reassigns orphaned partitions. |
| JobQueue (Kafka) | Kafka broker failure or network partition | Scheduler shards cannot publish due jobs. Workers have no new work. Jobs accumulate in time-wheel slots — they fire from the wheel but cannot be published. On Kafka recovery, the backlog is published in a burst. | Multi-AZ Kafka deployment (3 brokers, 3 AZs). min.insync.replicas=2 for durability. Local buffer on scheduler shards for failed publishes with retry. Monitor ISR count and broker availability. |
| ResultProcessor | All ResultProcessor workers crash | Execution results are not processed: job statuses remain stale, DAG children are not triggered, failed jobs are not dead-lettered. Workers continue executing normally — results buffer in ResultStream (Kafka retention: 7 days). On recovery, ResultProcessor processes the buffered backlog. | Auto-scaling with minimum replica count of 3. Health check probes with automatic restart. Monitor consumer lag on ResultStream — alert at >5K messages lag. |
| DLQ (SQS) | SQS service degradation or purge | Failed jobs cannot be dead-lettered. ResultProcessor retries SQS writes with exponential backoff. If SQS is down for an extended period, failed job context is lost from the pipeline — but execution records with status='failed' remain in JobDB for manual inspection. | SQS is a managed service with 99.9% SLA. Additionally, ResultProcessor writes failed execution status to JobDB before attempting DLQ write — the DB record is the fallback. AlertService can also query JobDB for failed executions as a secondary monitoring path. |
Scheduler shards scale by adding new shards and rebalancing via consistent hashing. Each new shard reduces per-shard job count and increases total scheduling throughput linearly. Workers auto-scale based on Kafka consumer lag: ShortWorker target <1K lag, LongWorker target CPU <70%. ResultProcessor auto-scales based on ResultStream consumer lag: target <1K lag. Kafka scales by adding partitions (requires consumer rebalance). JobDB scales by adding shards (requires data migration for new hash ranges). Redis scales by adding nodes to the cluster. The system is designed for 10x growth: from 1M to 10M active jobs by scaling to 30 scheduler shards and proportionally more workers.
Key metrics: (1) Per-shard time-wheel utilization — jobs loaded vs capacity. Alert if any shard exceeds 500K jobs. (2) Per-shard tick latency — time from slot fire to Kafka publish. Alert if exceeds 100ms (wheel falling behind). (3) Shard heartbeat status — alert within 5 seconds of any shard heartbeat timeout. (4) Kafka consumer lag on JobQueue — per worker pool. Alert if ShortWorker lag exceeds 5K or LongWorker lag exceeds 1K. (5) Kafka consumer lag on ResultStream — alert if ResultProcessor lag exceeds 5K. (6) DLQ depth — alert at 100 messages, critical at 500. (7) DAG completion rate — percentage of DAGs that complete all steps within SLA. Alert if drops below 95%. (8) Idempotency skip rate — percentage of worker messages skipped as duplicates. Normal: <0.1%. Alert at >1% (indicates shard failover instability). Dashboard: Grafana with per-shard panels, cross-shard aggregates, DAG dependency visualization, and DLQ trend.
At 1M active jobs: PostgreSQL db.r7g.2xlarge sharded (~$700/month), Redis cache.r7g.xlarge (~$300/month), MSK JobQueue kafka.m7g.xlarge 3 brokers (~$900/month), MSK ResultStream kafka.m7g.large 3 brokers (~$600/month), 3 Scheduler Shard pods (~$420/month), 8 SchedulerAPI pods (~$560/month), 60 ShortWorker pods (~$4,200/month), 20 LongWorker pods (~$1,400/month), 15 ResultProcessor pods (~$1,050/month), SQS DLQ (~$10/month), 2 AlertService pods (~$70/month), API Gateway (~$50/month), ALB (~$30/month). Total: ~$10,290/month. Per-job cost: $0.01/job/month. The sharded approach costs 2x the V1 variant but provides 2x throughput, shard isolation (99.95% vs 99.9%), DAG support, and DLQ observability.
API authentication via JWT tokens on every request. Multi-tenant isolation via tenant_id — queries filter by tenant_id, and shard assignment is tenant-aware to prevent cross-tenant interference. Job parameters in JSONB reference secrets via AWS Secrets Manager ARNs — plaintext secrets are rejected at the API layer. Kafka messages encrypted in transit (TLS) and at rest (KMS). DLQ messages may contain error details including stack traces — ensure DLQ access is restricted to the operations team. Worker execution sandboxed in isolated Fargate containers with no shared filesystem and no network access to other tenants' resources.
Scheduler shards deploy one at a time (canary). The shard being deployed transfers its partition to a peer shard, deploys, rebuilds its time-wheel, and reclaims the partition. Total transfer time: 10-15 seconds per shard. Zero-downtime for the overall system — at most one shard is offline at a time. SchedulerAPI deploys via rolling update (one pod at a time). Workers deploy via rolling update with Kafka consumer rebalance. ResultProcessor deploys via rolling update — consumer lag spikes briefly during rebalance but recovers within seconds. Database migrations require careful sequencing: new columns must be additive (nullable with defaults) to avoid breaking running queries during the migration window.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Single Scheduler + DB Polling) | T1 | 1-2s schedule precision, 80ms CRUD | ~500 jobs/sec dispatch | $1,000/month | Low | 99% (single instance, manual restart) |
| V1: Leader-Elected Queue (Advisory Lock + Kafka) | T2 | <1s schedule precision, 50ms CRUD | 10K jobs/sec | $3,500/month | Medium | 99.9% (leader failover <5s) |
| V2: Sharded Time-Wheel (N Shards + DAGs) | T3 | <100ms schedule precision, 50ms CRUD | 20K+ jobs/sec | $8,000/month | High | 99.95% (shard isolation, DLQ) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
The time-wheel is a circular array of 3600 slots, each representing one second (total horizon: 1 hour). A pointer advances one slot per second. When a job is created with next_run_at = 14:30:45, it is inserted into slot 45 of minute 30 of hour 14 (mapped to an index via modular arithmetic). When the pointer reaches that slot, all jobs in the slot are fired. For jobs scheduled beyond the 1-hour horizon, a hierarchical time-wheel is used: a coarse-grained wheel with 24 slots (one per hour) cascades jobs into the fine-grained wheel as they approach their fire time. Memory usage: each job slot entry is approximately 200 bytes (job_id, task, params reference), so 333K jobs per shard require approximately 64 MB — well within the 8 GB memory allocation.
The coordination service (etcd) detects the shard's heartbeat timeout within 5 seconds. It reassigns the orphaned partition (hash(job_id) mod 3 == crashed_shard_index) to one of the healthy shards. The healthy shard queries JobDB for all jobs in the reassigned partition and populates its time-wheel (5-10 seconds for 333K jobs). Total recovery time: 10-15 seconds. During this window, jobs in the crashed shard's partition may fire late. The idempotency check prevents double execution if the crashed shard had already published some events before failing.
When ResultProcessor sees a successful job completion, it queries job_dependencies for children: SELECT child_job_id FROM job_dependencies WHERE parent_job_id = ? AND status = 'pending'. For each child, it checks if all parents have completed: SELECT COUNT(*) FROM job_dependencies WHERE child_job_id = ? AND status != 'satisfied'. If the count is 0 (all parents satisfied), the child is ready to fire. ResultProcessor publishes the child's execution event to JobQueue, and a worker picks it up. This is a topological readiness check — children fire only when all parents have completed. Cycle detection is enforced at dependency creation time using a DFS-based cycle check.
A failed-jobs table requires operators to query the database to discover failures — a pull model that depends on someone remembering to check. The DLQ is a push model: AlertService monitors depth and triggers PagerDuty automatically. SQS DLQ also provides built-in features: message retention (14 days), visibility timeout (prevents concurrent processing of the same failed job), and redrive policy (replay failed jobs back to the original queue). The DLQ integrates with AWS CloudWatch for metric-based alarming without custom monitoring code.
Adding a 4th shard changes the partitioning from mod 3 to mod 4. The coordination service orchestrates the rebalance: (1) assign new shard ownership in etcd, (2) existing shards stop firing jobs in the reassigned partition ranges, (3) the new shard loads its partition from JobDB and builds its time-wheel, (4) the new shard starts firing. During rebalancing (10-20 seconds), some jobs may be orphaned (no shard owns them) or dual-owned (two shards claim them). Orphaned jobs are caught by a reconciliation scan after rebalance completes. Dual-owned jobs are prevented from double execution by the idempotency check on execution_id.
Sign in to join the discussion.
Ready to design your own Job Scheduler?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator