Vetora logo
Hard15 componentsInterview: High

Job Scheduler — Sharded Time-Wheel + DAG Workflows

Production-grade distributed job scheduler with N independent scheduler shards using in-memory time-wheels for sub-second scheduling precision. Sharded JobDB by hash(job_id), separate worker pools for short and long tasks, dead-letter queue for permanently failed jobs, DAG support for multi-step workflows, and idempotency via unique execution IDs.

ReliabilityShardingTime-WheelDAGDLQJob Scheduler
Problem Statement

The sharded time-wheel approach to job scheduling represents the production architecture used by FAANG-scale scheduling systems like Google Cloud Scheduler, AWS Step Functions, and Temporal's internal task distribution. It solves the two fundamental limitations of the leader-elected approach (V1): the single-leader polling bottleneck and the lack of DAG workflow support.

The key insight is replacing database polling with an in-memory time-wheel. A time-wheel is a circular buffer of slots, each representing one second. When a job is created, it is inserted into the slot corresponding to its next_run_at time. Each second, the wheel advances by one slot and fires all jobs in the current slot — no database query needed. This eliminates the constant polling load that grows with job count. At 1M active jobs per shard, the time-wheel fires jobs from the current slot in O(K) where K is the number of due jobs (typically 100-1000), compared to V1's polling query which scans the entire time-bucket index (O(log N + K)).

Sharding distributes the scheduling load across N independent scheduler instances. Each shard owns a partition of jobs determined by hash(job_id) mod N. With 3 shards, each shard manages approximately 333K jobs out of 1M total. If one shard fails, only 1/3 of jobs are affected — the other 2/3 continue scheduling without interruption. The failed shard's partition is reassigned to a healthy shard within seconds via a coordination service (etcd or ZooKeeper). This is a critical improvement over V1 where a leader failure halts all scheduling for 5-10 seconds.

DAG (Directed Acyclic Graph) support enables multi-step workflows where the completion of one job triggers the next. The job_dependencies table stores parent-child relationships. When ResultProcessor sees a parent job complete successfully, it queries for children whose all parents have completed (topological readiness check) and triggers them by publishing to JobQueue. This is the same dependency resolution algorithm used by Apache Airflow, Temporal Workflows, and Prefect. Cycle detection is enforced at job creation time — attempting to create a circular dependency returns a 400 error.

The dead-letter queue (DLQ) captures permanently failed jobs (those that exceeded max retries) with full context: execution_id, job definition, error message, retry history. AlertService monitors DLQ depth and triggers PagerDuty when failures exceed thresholds. This replaces V0's silent failure mode where failed jobs were marked in the database and forgotten.

Worker pools are separated by estimated task duration. WorkerPoolShort (60 pods) handles tasks under 1 minute with high parallelism. WorkerPoolLong (20 pods) handles tasks over 1 minute with more CPU and memory per pod. Heartbeat-based liveness detection on long workers ensures that stuck tasks are detected and re-enqueued within 60 seconds.

The primary trade-offs are operational complexity (15 components including 3 scheduler shards, 2 Kafka clusters, Redis, PostgreSQL, ResultProcessor, DLQ, and AlertService) and time-wheel cold-start latency (5-10 seconds to rebuild the wheel from JobDB on shard restart). These are justified at scale exceeding 100K active jobs where simpler architectures hit their throughput ceiling.

Interviewers expect candidates to explain the time-wheel data structure, discuss shard ownership and failover, reason about DAG dependency resolution (topological sort), and analyze the DLQ pattern for failure observability.

Architecture Overview

The sharded time-wheel architecture uses fifteen components organized into five layers: traffic ingestion (SchedulerClient, ApiGateway, SchedulerLB), CRUD API (SchedulerAPI), scheduling (SchedulerShardA, SchedulerShardB, SchedulerShardC), data stores (JobCache/Redis, JobDB/PostgreSQL), execution (JobQueue/Kafka, WorkerPoolShort, WorkerPoolLong), and result processing (ResultStream/Kafka, ResultProcessor, DLQ/SQS, AlertService).

The CRUD path is fully separated from the scheduling path. SchedulerAPI (8 pods) handles all external requests: job creation, status queries, deletion, pause/resume, manual execution, and DAG dependency management. It writes to JobDB (sharded by hash(job_id)) and caches state in JobCache (Redis, 85% hit rate). On job creation, SchedulerAPI notifies the owning scheduler shard to load the new job into its time-wheel. This notification is a lightweight internal RPC that adds approximately 2ms to the creation path.

The scheduling path operates independently of the CRUD path. Three scheduler shards (A, B, C) each maintain an in-memory time-wheel with 3600 slots (1-second resolution, 1-hour horizon). Shard A owns jobs where hash(job_id) mod 3 == 0, Shard B owns mod 3 == 1, Shard C owns mod 3 == 2. On startup, each shard queries its partition from JobDB to populate the wheel (cold start: 5-10 seconds for 333K jobs). During steady state, no DB queries are needed — the wheel fires jobs from the current slot. Each second, the shard advances the wheel and publishes due jobs to JobQueue (Kafka) with unique execution_ids.

JobQueue (Amazon MSK, 64 partitions, 1M msg/sec capacity) distributes execution events to two worker pools. WorkerPoolShort (60 pods, 2 vCPU / 4 GB) handles tasks under 1 minute with high parallelism. WorkerPoolLong (20 pods, 4 vCPU / 8 GB) handles heavy tasks with heartbeat-based liveness detection. Both pools check execution_id idempotency before running. On completion, workers publish results to ResultStream (a separate Kafka cluster).

ResultProcessor (15 workers) consumes from ResultStream and performs three actions: (1) updates JobDB with execution status and duration, (2) for DAG jobs, checks if all parent dependencies are satisfied and triggers children by publishing to JobQueue, (3) for permanently failed jobs (retry_count >= max_retries), routes the execution record to DLQ (SQS) and triggers AlertService.

DLQ (Amazon SQS) stores permanently failed jobs with full context for manual inspection. Messages are retained for 14 days. AlertService (2 pods) polls DLQ depth every 10 seconds and triggers PagerDuty when the count exceeds 100 messages or when a scheduler shard stops heartbeating.

Horizontal scaling: add scheduler shards to increase scheduling throughput (linear scaling). Add workers to increase execution throughput. Add Kafka partitions for higher message throughput. Shard rebalancing is managed by the coordination service — adding a 4th shard redistributes 25% of jobs from each existing shard to the new one via consistent hashing.

Architecture Preview
Loading architecture preview...
Request Flow — Time-Wheel Fire + DAG Trigger + DLQ Routing

This sequence diagram traces four critical flows: job creation with time-wheel insertion, time-wheel tick firing, DAG child triggering via ResultProcessor, and dead-letter routing for permanently failed jobs.

The key insight is that the time-wheel replaces DB polling entirely. Once a job is loaded into the wheel (on creation or cold start), it fires from memory with zero DB I/O. The second insight is the ResultProcessor's dual role: it updates execution status and also handles DAG dependency resolution — checking if all parents are satisfied before triggering children.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Client creates a job with optional DAG dependency (parent_job_id). SchedulerAPI writes to sharded JobDB and notifies the owning scheduler shard to insert the job into its time-wheel at the correct slot
  2. 2Every second, the scheduler shard advances its time-wheel pointer and fires all jobs in the current slot. This is a pure in-memory operation — no DB query. Execution events are batch-published to Kafka with deterministic execution_ids
  3. 3Workers consume from Kafka, perform the idempotency check (INSERT ... ON CONFLICT), and execute the task if it is new. On completion, they publish results to ResultStream
  4. 4ResultProcessor consumes results. For successful completions: updates JobDB, marks parent dependencies as satisfied, checks if any children are ready (all parents satisfied), and triggers them via Kafka
  5. 5For permanently failed jobs (retry count exceeded): updates execution status to 'dead-lettered', sends full context to the SQS DLQ, and triggers AlertService for PagerDuty notification

Pseudocode

// TIME-WHEEL STRUCTURE
class TimeWheel:
    slots: Array[3600]  // 1 slot per second, 1-hour horizon
    pointer: int = 0     // Current slot index

    function insert(job, next_run_at):
        slot_index = next_run_at.unix_seconds % 3600
        slots[slot_index].add(job)

    function tick():
        pointer = (pointer + 1) % 3600
        due_jobs = slots[pointer]  // O(K) — K jobs in this slot
        slots[pointer] = []        // Clear the slot

        events = []
        for job in due_jobs:
            exec_id = hash(job.job_id + current_time)
            events.push({ execution_id: exec_id, job_id: job.job_id, task: job.task })

            // Reinsert recurring jobs into future slot
            if job.schedule.startsWith("cron:"):
                next = compute_next_cron(job.schedule)
                insert(job, next)

        kafka.publishBatch("job-execution", events)  // ~4ms for 1000 events

// DAG DEPENDENCY RESOLUTION (in ResultProcessor)
async function processResult(result):
    await db.execute("UPDATE job_executions SET status=$1 WHERE execution_id=$2",
        [result.status, result.execution_id])

    if result.status == "completed":
        // Mark parent dependency as satisfied
        await db.execute(
            "UPDATE job_dependencies SET status='satisfied' WHERE parent_job_id=$1",
            [result.job_id])

        // Find children whose ALL parents are now satisfied
        ready_children = await db.query(
            "SELECT child_job_id FROM job_dependencies " +
            "WHERE parent_job_id = $1 AND status = 'satisfied' " +
            "AND NOT EXISTS (SELECT 1 FROM job_dependencies " +
            "WHERE child_job_id = job_dependencies.child_job_id AND status != 'satisfied')",
            [result.job_id])

        for child_id in ready_children:
            child_job = await db.query("SELECT * FROM jobs WHERE job_id = $1", [child_id])
            exec_id = hash(child_id + current_time)
            await kafka.publish("job-execution", {
                execution_id: exec_id, job_id: child_id,
                task: child_job.task, params: child_job.params
            })

    else if result.status == "failed" and result.retry_count >= MAX_RETRIES:
        await db.execute("UPDATE job_executions SET status='dead-lettered' WHERE execution_id=$1",
            [result.execution_id])
        await sqs.sendMessage(DLQ_URL, { execution_id: result.execution_id, error: result.error_message })
        await alertService.trigger("job_permanently_failed", result)
Database Schema (ER Diagram)

The V2 schema adds two key tables beyond V1: job_dependencies for DAG workflows and expanded job_executions with dead-letter status tracking. The jobs table gains shard_id for partition ownership and estimated_duration for worker pool routing. The three tables together support the full lifecycle: job definition (jobs), execution tracking with idempotency (job_executions), and DAG dependency resolution (job_dependencies).

The jobs table is sharded by hash(job_id) across 3 partitions matching the scheduler shards. The shard_id column determines which scheduler shard owns the job and loads it into its time-wheel. The job_executions table uses the same deterministic execution_id as V1 for idempotency, with the addition of dead-lettered status. The job_dependencies table stores directed edges in the DAG with a unique constraint preventing duplicate edges.

Loading diagram...

Step-by-Step Walkthrough

  1. 1The jobs table is sharded by hash(job_id). The shard_id column maps to the owning scheduler shard. On cold start, each shard queries SELECT * FROM jobs WHERE shard_id = ? to populate its time-wheel
  2. 2The job_executions table uses deterministic execution_id for idempotency. Workers INSERT ... ON CONFLICT (execution_id) DO NOTHING to prevent double execution. The status column now includes 'dead-lettered' for jobs routed to the DLQ
  3. 3The job_dependencies table stores DAG edges: parent_job_id -> child_job_id. The unique constraint (parent_job_id, child_job_id) prevents duplicate edges. Cycle detection runs at INSERT time via DFS
  4. 4ResultProcessor queries job_dependencies on parent completion: first UPDATE status to 'satisfied', then check if all parents of each child are satisfied. Ready children are triggered via JobQueue
  5. 5Partitioning strategy: jobs by hash(job_id) across 3 shards, job_executions co-located with jobs by job_id hash, job_dependencies not sharded (small table, typically <100K rows)

Pseudocode

-- JOBS TABLE: Sharded by hash(job_id), time-wheel source
CREATE TABLE jobs (
    job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    schedule TEXT NOT NULL,
    task TEXT NOT NULL,
    params JSONB,
    next_run_at TIMESTAMPTZ NOT NULL,
    status TEXT NOT NULL DEFAULT 'active',
    estimated_duration TEXT NOT NULL DEFAULT 'short',
    shard_id INTEGER NOT NULL,  -- hash(job_id) % 3
    tenant_id UUID NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_jobs_shard ON jobs (shard_id, next_run_at)
    WHERE status = 'active';  -- Cold-start wheel rebuild

-- JOB_EXECUTIONS TABLE: Idempotent + dead-letter tracking
CREATE TABLE job_executions (
    execution_id UUID PRIMARY KEY,  -- hash(job_id + scheduled_time)
    job_id UUID NOT NULL REFERENCES jobs(job_id),
    status TEXT NOT NULL DEFAULT 'pending',
    started_at TIMESTAMPTZ,
    finished_at TIMESTAMPTZ,
    duration_ms INTEGER,
    retry_count INTEGER NOT NULL DEFAULT 0,
    error_message TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_executions_job ON job_executions (job_id, created_at DESC);
CREATE INDEX idx_executions_failed ON job_executions (status)
    WHERE status IN ('failed', 'dead-lettered');

-- JOB_DEPENDENCIES TABLE: DAG edges
CREATE TABLE job_dependencies (
    dependency_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parent_job_id UUID NOT NULL REFERENCES jobs(job_id),
    child_job_id UUID NOT NULL REFERENCES jobs(job_id),
    status TEXT NOT NULL DEFAULT 'pending',
    created_at TIMESTAMPTZ DEFAULT now(),
    UNIQUE (parent_job_id, child_job_id)
);
CREATE INDEX idx_deps_parent ON job_dependencies (parent_job_id, status);
CREATE INDEX idx_deps_child ON job_dependencies (child_job_id, status);

-- DAG TRIGGER QUERY (runs in ResultProcessor on parent completion)
UPDATE job_dependencies SET status = 'satisfied'
    WHERE parent_job_id = $1;
SELECT child_job_id FROM job_dependencies
    WHERE parent_job_id = $1
    AND NOT EXISTS (
        SELECT 1 FROM job_dependencies d2
        WHERE d2.child_job_id = job_dependencies.child_job_id
        AND d2.status != 'satisfied'
    );  -- Returns children where ALL parents are now satisfied
Key Design Decisions
Time-Wheel Instead of DB Polling

Choice

In-memory circular buffer with 3600 slots (1-second resolution) instead of SQL polling

Rationale

The time-wheel eliminates the constant DB polling overhead. In V1, the leader runs SELECT ... WHERE next_run_at <= NOW() every second — a query that scans the time-bucket index even when no jobs are due. With 1M active jobs, this scan takes 50-100ms per cycle. The time-wheel fires in O(K) where K is the number of due jobs in the current slot, with zero DB I/O. The trade-off is memory (approximately 1 GB per shard for 333K jobs) and cold-start latency (5-10 seconds to rebuild from DB on restart).

Sharded Schedulers for Horizontal Scaling

Choice

N independent scheduler shards, each owning hash(job_id) mod N partition

Rationale

A single leader (V1) bottlenecks at 10K jobs/sec. With 3 shards, each handles approximately 7K jobs/sec, tripling total throughput to 20K+ jobs/sec. More importantly, shard isolation limits the blast radius of failures: if Shard A crashes, only 1/3 of jobs are delayed while Shards B and C continue normally. In V1, a leader crash halts all scheduling. Shard ownership is managed by etcd/ZooKeeper — on shard failure, the orphaned partition is reassigned to a healthy shard within seconds.

DAG Support via ResultProcessor

Choice

job_dependencies table with topological readiness checks on parent completion

Rationale

Many real-world workflows require job-to-job dependencies: ETL pipeline stages (extract -> transform -> load), CI/CD (build -> test -> deploy), data pipelines (ingest -> validate -> aggregate). Without DAG support (V0 and V1), these workflows require manual orchestration or external tools like Airflow. The ResultProcessor handles dependency resolution: when a parent completes, it queries SELECT child_job_id FROM job_dependencies WHERE parent_job_id = ? AND status = 'satisfied', checks if all parents of each child are complete, and triggers ready children.

Dead-Letter Queue for Failed Job Observability

Choice

SQS DLQ with AlertService monitoring instead of silent database-only failure

Rationale

In V0, failed jobs are marked in the database and forgotten. In V1, they are logged but not surfaced. The DLQ pattern ensures permanently failed jobs are captured with full context (execution_id, error message, retry history, job definition) for manual inspection and replay. AlertService polls DLQ depth every 10 seconds and triggers PagerDuty when thresholds are exceeded. This is the standard failure observability pattern used by AWS Lambda DLQ, SQS DLQ, and Kafka error topics.

Separate ResultStream for Execution Results

Choice

Dedicated Kafka cluster for job results instead of reusing JobQueue

Rationale

Separating the result stream from the job queue prevents result processing from competing with job dispatch for Kafka throughput. JobQueue handles 20K+ msg/sec of execution events. ResultStream handles 20K+ msg/sec of completion results. Using a single Kafka cluster would require 40K+ msg/sec capacity and careful topic/partition management to prevent interference. Separate clusters also enable independent scaling and failure isolation.

Idempotency via Deterministic Execution ID

Choice

execution_id = hash(job_id + scheduled_time) with INSERT ... ON CONFLICT

Rationale

Shard failover can cause the same job to be fired by two shards briefly (old shard fires before failover, new shard fires after). Kafka redelivery can cause the same execution event to be consumed twice. The deterministic execution_id ensures that for a given job at a given scheduled time, only one execution can proceed. Workers check via INSERT ... ON CONFLICT (execution_id) DO NOTHING — an atomic, lock-free deduplication mechanism.

Scale & Performance

Target RPS

~40K peak (6K CRUD + 22K status + 12K execution/DAG)

Latency (p99)

<100ms schedule precision, 50ms CRUD p99, 8ms status (cache hit)

Storage

~500 GB/year (sharded DB + 2 Kafka clusters + DLQ + execution history)

Availability

99.95% (shard isolation, DLQ, multi-AZ, heartbeat monitoring)

Time & Space Complexity
OperationTimeSpaceNotes
Time-wheel tick (fire due jobs from current slot)O(K) — K jobs in the current slotO(J/S) — J total jobs, S shard count, stored in the wheelEach second, the wheel advances and fires all jobs in the current slot. Typical K = 100-1000 jobs per slot per shard. No DB query — pure in-memory operation. The wheel consumes approximately 64 MB per 333K jobs (200 bytes per job slot entry).
Cold-start wheel rebuild (SELECT ... WHERE shard_id = ?)O(J/S) — scan all jobs in the shard's partitionO(J/S) — load all jobs into memoryOn shard restart, the full partition must be loaded: SELECT * FROM jobs WHERE shard_id = ? AND status = 'active'. For 333K jobs, this takes 5-10 seconds. During this window, jobs may fire late. A warm standby shard can reduce this gap.
DAG readiness check (all parents completed?)O(P) — P parents of the child jobO(1) — constant per checkFor each completed parent, ResultProcessor queries: SELECT COUNT(*) FROM job_dependencies WHERE child_job_id = ? AND status != 'satisfied'. If count == 0, all parents are done and the child is triggered. Typical P = 1-5 parents per job in production DAGs.
Cycle detection (DFS on dependency graph)O(V + E) — V jobs, E dependencies in the connected componentO(V) — visited setRuns at INSERT time for new dependencies. Starting from child_job_id, DFS follows existing edges to check if it can reach parent_job_id. In practice, DAGs are small (10-50 jobs) so the DFS completes in microseconds.
Database Schema (HLD)
jobs

Stores job definitions sharded by hash(job_id). Each scheduler shard queries its partition on cold start to populate the time-wheel. Includes estimated_duration for worker pool routing and shard_id for partition ownership.

job_id UUID PK DEFAULT gen_random_uuid() (shard key — hash determines owning shard)schedule TEXT NOT NULL (cron expression or ISO timestamp)task TEXT NOT NULL (task identifier)params JSONB (task parameters)next_run_at TIMESTAMPTZ NOT NULLstatus TEXT NOT NULL DEFAULT 'active' (active|paused|deleted)estimated_duration TEXT NOT NULL DEFAULT 'short' (short|long — routes to worker pool)shard_id INTEGER NOT NULL (owning scheduler shard: 0, 1, or 2)tenant_id UUID NOT NULL (multi-tenant isolation)created_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on job_id, idx_jobs_shard ON (shard_id, next_run_at) WHERE status = 'active' (cold-start wheel rebuild), idx_jobs_tenant ON (tenant_id, status) (per-tenant job listing)

The shard_id column determines which scheduler shard owns the job. On cold start, the shard queries SELECT * FROM jobs WHERE shard_id = ? AND status = 'active' to populate its time-wheel. During steady state, no DB queries are needed — the wheel fires from memory. shard_id is computed as hash(job_id) mod N at job creation time.

job_executions

Append-only execution history with unique execution_id for idempotency. Includes retry_count and error_message for failure tracking. Rows with status='dead-lettered' have been routed to the DLQ.

execution_id UUID PK (hash of job_id + scheduled_time — idempotency key)job_id UUID FK REFERENCES jobs(job_id)status TEXT NOT NULL (pending|running|completed|failed|dead-lettered)started_at TIMESTAMPTZfinished_at TIMESTAMPTZduration_ms INTEGERretry_count INTEGER NOT NULL DEFAULT 0error_message TEXT (error details for failed/dead-lettered executions)created_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on execution_id (idempotency check via INSERT ... ON CONFLICT), idx_executions_job ON (job_id, created_at DESC) (per-job history), idx_executions_status ON (status) WHERE status IN ('failed','dead-lettered') (failure dashboard)

The execution_id is deterministic: hash(job_id + scheduled_time). This enables INSERT ... ON CONFLICT (execution_id) DO NOTHING for atomic deduplication. The partial index on failed/dead-lettered status supports the failure dashboard without scanning the entire table.

job_dependencies

DAG dependency graph for workflow orchestration. Each row represents a directed edge: parent_job_id must complete before child_job_id can fire. Cycle detection is enforced at write time. ResultProcessor queries this table on parent completion to find ready children.

dependency_id UUID PK DEFAULT gen_random_uuid()parent_job_id UUID FK REFERENCES jobs(job_id)child_job_id UUID FK REFERENCES jobs(job_id)status TEXT NOT NULL DEFAULT 'pending' (pending|satisfied|failed)created_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on dependency_id, idx_deps_parent ON (parent_job_id, status) (find children to trigger on parent completion), idx_deps_child ON (child_job_id, status) (check if all parents are satisfied for a child), UNIQUE (parent_job_id, child_job_id) (prevent duplicate edges)

Cycle detection runs at INSERT time: a DFS from child_job_id through existing edges checks if it can reach parent_job_id. If so, the INSERT is rejected with a 400 error. On parent completion, ResultProcessor updates status to 'satisfied' and checks if all of the child's parents are satisfied — if so, the child is triggered.

Event Contracts
job-executionjob-execution (JobQueue)

Published by scheduler shards when a job's time-wheel slot fires. Consumed by WorkerPoolShort and WorkerPoolLong based on estimated_duration filtering. Also published by ResultProcessor when triggering DAG children.

Key Schema

job_id (string) — partition key for per-job ordering

Value Schema

{ execution_id: string (idempotency key), job_id: string, task: string, params: object, estimated_duration: string (short|long), shard_id: integer, scheduled_time: string (ISO timestamp) }

job-resultjob-result (ResultStream)

Published by workers on task completion (success or failure). Consumed by ResultProcessor for status updates, DAG triggering, and DLQ routing.

Key Schema

job_id (string) — partition key for per-job ordering

Value Schema

{ execution_id: string, job_id: string, status: string (completed|failed), duration_ms: integer, retry_count: integer, error_message: string, output: string (truncated to 10KB) }

What-If Scenarios

Scheduler Shard A crashes (host failure, OOM, network partition)

Impact

Jobs in Shard A's partition (hash(job_id) mod 3 == 0, approximately 333K jobs) stop firing. Shards B and C continue scheduling 2/3 of all jobs without interruption. The coordination service detects the heartbeat timeout within 5 seconds and reassigns the partition. The receiving shard rebuilds its time-wheel for the new partition (5-10 seconds). Total recovery: 10-15 seconds. During this window, approximately 3K jobs (10 seconds x 333 due/sec) may fire 10-15 seconds late.

Mitigation

Warm standby shard: maintain a standby shard that mirrors the primary's time-wheel but does not fire. On primary failure, the standby activates within 1-2 seconds. Alternatively, reduce heartbeat timeout from 5 seconds to 2 seconds for faster detection.

DAG parent job fails permanently (exceeds max retries)

Impact

The parent job is dead-lettered. Its children never receive a trigger — the DAG is stuck in a partial state. Some branches may have completed (if they had no dependency on the failed parent), while the failed parent's children and their descendants are blocked indefinitely. The DAG status shows 'partial' in the dashboard.

Mitigation

Manual intervention via the CRUD API: POST /api/v1/jobs/{parent_id}/force-complete marks the failed parent as 'force-completed', satisfying its dependencies and allowing children to trigger. Alternatively, POST /api/v1/jobs/{parent_id}/skip marks the parent as skipped and triggers children with a flag indicating the parent was not actually executed. Both require operator judgment about whether skipping the parent is safe.

ResultProcessor falls behind (consumer lag grows on ResultStream)

Impact

Execution results are delayed: job statuses in JobDB and JobCache are stale. DAG children are delayed because their parent completions have not been processed. DLQ routing is delayed — failed jobs are not alerted on immediately. The delay is transparent to workers (they continue executing) but affects the control plane: dashboards show outdated status and DAG workflows slow down.

Mitigation

Auto-scale ResultProcessor workers based on Kafka consumer lag (target: <1K lag). Increase ResultStream partition count to enable more parallel consumers. Add a circuit breaker: if lag exceeds 10K messages, temporarily skip DAG dependency checks and process status updates only (catch up faster).

Time-wheel memory exhaustion (too many jobs in a single shard)

Impact

If a shard owns more than 2M jobs, the time-wheel consumes approximately 400 MB — still within the 8 GB allocation. But at 5M+ jobs per shard, memory pressure triggers GC pauses that cause the wheel to miss tick deadlines, firing jobs 100ms-1s late. At 10M+ jobs per shard, OOM kills the process.

Mitigation

Add more scheduler shards to reduce per-shard job count. With 6 shards instead of 3, each shard owns approximately 167K jobs (at 1M total). Monitor per-shard job count and trigger shard addition when any shard exceeds 500K jobs. The coordination service handles rebalancing automatically.

Failure Modes & Resilience
ComponentFailureImpactMitigation
Scheduler Shard (any of 3)Shard crash, OOM, or network partition from JobDB1/3 of jobs stop firing. Other shards are unaffected. Recovery: 10-15 seconds (heartbeat timeout + wheel rebuild). During recovery, affected jobs fire late but are not lost.Multi-AZ shard deployment (each shard in a different AZ). Warm standby for each shard. Heartbeat monitoring with PagerDuty alerting. Coordination service auto-reassigns orphaned partitions.
JobQueue (Kafka)Kafka broker failure or network partitionScheduler shards cannot publish due jobs. Workers have no new work. Jobs accumulate in time-wheel slots — they fire from the wheel but cannot be published. On Kafka recovery, the backlog is published in a burst.Multi-AZ Kafka deployment (3 brokers, 3 AZs). min.insync.replicas=2 for durability. Local buffer on scheduler shards for failed publishes with retry. Monitor ISR count and broker availability.
ResultProcessorAll ResultProcessor workers crashExecution results are not processed: job statuses remain stale, DAG children are not triggered, failed jobs are not dead-lettered. Workers continue executing normally — results buffer in ResultStream (Kafka retention: 7 days). On recovery, ResultProcessor processes the buffered backlog.Auto-scaling with minimum replica count of 3. Health check probes with automatic restart. Monitor consumer lag on ResultStream — alert at >5K messages lag.
DLQ (SQS)SQS service degradation or purgeFailed jobs cannot be dead-lettered. ResultProcessor retries SQS writes with exponential backoff. If SQS is down for an extended period, failed job context is lost from the pipeline — but execution records with status='failed' remain in JobDB for manual inspection.SQS is a managed service with 99.9% SLA. Additionally, ResultProcessor writes failed execution status to JobDB before attempting DLQ write — the DB record is the fallback. AlertService can also query JobDB for failed executions as a secondary monitoring path.
Scaling Strategy

Scheduler shards scale by adding new shards and rebalancing via consistent hashing. Each new shard reduces per-shard job count and increases total scheduling throughput linearly. Workers auto-scale based on Kafka consumer lag: ShortWorker target <1K lag, LongWorker target CPU <70%. ResultProcessor auto-scales based on ResultStream consumer lag: target <1K lag. Kafka scales by adding partitions (requires consumer rebalance). JobDB scales by adding shards (requires data migration for new hash ranges). Redis scales by adding nodes to the cluster. The system is designed for 10x growth: from 1M to 10M active jobs by scaling to 30 scheduler shards and proportionally more workers.

Monitoring & Alerting

Key metrics: (1) Per-shard time-wheel utilization — jobs loaded vs capacity. Alert if any shard exceeds 500K jobs. (2) Per-shard tick latency — time from slot fire to Kafka publish. Alert if exceeds 100ms (wheel falling behind). (3) Shard heartbeat status — alert within 5 seconds of any shard heartbeat timeout. (4) Kafka consumer lag on JobQueue — per worker pool. Alert if ShortWorker lag exceeds 5K or LongWorker lag exceeds 1K. (5) Kafka consumer lag on ResultStream — alert if ResultProcessor lag exceeds 5K. (6) DLQ depth — alert at 100 messages, critical at 500. (7) DAG completion rate — percentage of DAGs that complete all steps within SLA. Alert if drops below 95%. (8) Idempotency skip rate — percentage of worker messages skipped as duplicates. Normal: <0.1%. Alert at >1% (indicates shard failover instability). Dashboard: Grafana with per-shard panels, cross-shard aggregates, DAG dependency visualization, and DLQ trend.

Cost Analysis

At 1M active jobs: PostgreSQL db.r7g.2xlarge sharded (~$700/month), Redis cache.r7g.xlarge (~$300/month), MSK JobQueue kafka.m7g.xlarge 3 brokers (~$900/month), MSK ResultStream kafka.m7g.large 3 brokers (~$600/month), 3 Scheduler Shard pods (~$420/month), 8 SchedulerAPI pods (~$560/month), 60 ShortWorker pods (~$4,200/month), 20 LongWorker pods (~$1,400/month), 15 ResultProcessor pods (~$1,050/month), SQS DLQ (~$10/month), 2 AlertService pods (~$70/month), API Gateway (~$50/month), ALB (~$30/month). Total: ~$10,290/month. Per-job cost: $0.01/job/month. The sharded approach costs 2x the V1 variant but provides 2x throughput, shard isolation (99.95% vs 99.9%), DAG support, and DLQ observability.

Security Considerations

API authentication via JWT tokens on every request. Multi-tenant isolation via tenant_id — queries filter by tenant_id, and shard assignment is tenant-aware to prevent cross-tenant interference. Job parameters in JSONB reference secrets via AWS Secrets Manager ARNs — plaintext secrets are rejected at the API layer. Kafka messages encrypted in transit (TLS) and at rest (KMS). DLQ messages may contain error details including stack traces — ensure DLQ access is restricted to the operations team. Worker execution sandboxed in isolated Fargate containers with no shared filesystem and no network access to other tenants' resources.

Deployment Strategy

Scheduler shards deploy one at a time (canary). The shard being deployed transfers its partition to a peer shard, deploys, rebuilds its time-wheel, and reclaims the partition. Total transfer time: 10-15 seconds per shard. Zero-downtime for the overall system — at most one shard is offline at a time. SchedulerAPI deploys via rolling update (one pod at a time). Workers deploy via rolling update with Kafka consumer rebalance. ResultProcessor deploys via rolling update — consumer lag spikes briefly during rebalance but recovers within seconds. Database migrations require careful sequencing: new columns must be additive (nullable with defaults) to avoid breaking running queries during the migration window.

Real-World Examples
  • Google Cloud Scheduler uses a sharded architecture with independent scheduling cells — each cell owns a partition of cron jobs and fires independently
  • Temporal's internal task queue distribution uses consistent hashing to assign workflow tasks to shard-owning history service instances, with automatic failover via membership protocol
  • Apache Airflow 2.0's DAG processor uses a separate scheduler per DAG file partition, conceptually similar to job-level sharding with file-level parallelism
  • Uber's Cadence (predecessor to Temporal) uses sharded history service instances with time-based task queues for scheduled workflow execution
Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Single Scheduler + DB Polling)T11-2s schedule precision, 80ms CRUD~500 jobs/sec dispatch$1,000/monthLow99% (single instance, manual restart)
V1: Leader-Elected Queue (Advisory Lock + Kafka)T2<1s schedule precision, 50ms CRUD10K jobs/sec$3,500/monthMedium99.9% (leader failover <5s)
V2: Sharded Time-Wheel (N Shards + DAGs)T3<100ms schedule precision, 50ms CRUD20K+ jobs/sec$8,000/monthHigh99.95% (shard isolation, DLQ)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does the time-wheel data structure work?

The time-wheel is a circular array of 3600 slots, each representing one second (total horizon: 1 hour). A pointer advances one slot per second. When a job is created with next_run_at = 14:30:45, it is inserted into slot 45 of minute 30 of hour 14 (mapped to an index via modular arithmetic). When the pointer reaches that slot, all jobs in the slot are fired. For jobs scheduled beyond the 1-hour horizon, a hierarchical time-wheel is used: a coarse-grained wheel with 24 slots (one per hour) cascades jobs into the fine-grained wheel as they approach their fire time. Memory usage: each job slot entry is approximately 200 bytes (job_id, task, params reference), so 333K jobs per shard require approximately 64 MB — well within the 8 GB memory allocation.

What happens when a scheduler shard crashes?

The coordination service (etcd) detects the shard's heartbeat timeout within 5 seconds. It reassigns the orphaned partition (hash(job_id) mod 3 == crashed_shard_index) to one of the healthy shards. The healthy shard queries JobDB for all jobs in the reassigned partition and populates its time-wheel (5-10 seconds for 333K jobs). Total recovery time: 10-15 seconds. During this window, jobs in the crashed shard's partition may fire late. The idempotency check prevents double execution if the crashed shard had already published some events before failing.

How does DAG dependency resolution work?

When ResultProcessor sees a successful job completion, it queries job_dependencies for children: SELECT child_job_id FROM job_dependencies WHERE parent_job_id = ? AND status = 'pending'. For each child, it checks if all parents have completed: SELECT COUNT(*) FROM job_dependencies WHERE child_job_id = ? AND status != 'satisfied'. If the count is 0 (all parents satisfied), the child is ready to fire. ResultProcessor publishes the child's execution event to JobQueue, and a worker picks it up. This is a topological readiness check — children fire only when all parents have completed. Cycle detection is enforced at dependency creation time using a DFS-based cycle check.

Why a separate DLQ instead of a failed-jobs table?

A failed-jobs table requires operators to query the database to discover failures — a pull model that depends on someone remembering to check. The DLQ is a push model: AlertService monitors depth and triggers PagerDuty automatically. SQS DLQ also provides built-in features: message retention (14 days), visibility timeout (prevents concurrent processing of the same failed job), and redrive policy (replay failed jobs back to the original queue). The DLQ integrates with AWS CloudWatch for metric-based alarming without custom monitoring code.

How do you handle shard rebalancing when adding a new scheduler shard?

Adding a 4th shard changes the partitioning from mod 3 to mod 4. The coordination service orchestrates the rebalance: (1) assign new shard ownership in etcd, (2) existing shards stop firing jobs in the reassigned partition ranges, (3) the new shard loads its partition from JobDB and builds its time-wheel, (4) the new shard starts firing. During rebalancing (10-20 seconds), some jobs may be orphaned (no shard owns them) or dual-owned (two shards claim them). Orphaned jobs are caught by a reconciliation scan after rebalance completes. Dual-owned jobs are prevented from double execution by the idempotency check on execution_id.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Job Scheduler?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator