The simplest possible job scheduler architecture: a single SchedulerService polls PostgreSQL every second for due jobs and dispatches them directly to workers via HTTP. No leader election, no message queue, no idempotency, no dead-letter queue. Demonstrates why distributed scheduling, message queues, and idempotency keys become essential as job volume grows.
Job scheduling is a foundational system design interview question because it combines reliability engineering, distributed coordination, idempotency, and failure recovery into a single problem. Companies like Airflow, Temporal, AWS Step Functions, and Google Cloud Scheduler all ask variants of this question because it directly maps to their core product challenges. Every production system — from sending daily email digests to running nightly ETL pipelines to executing cron-based health checks — depends on a job scheduler that fires reliably, on time, and exactly once.
The naive approach uses the simplest possible architecture: a single SchedulerService backed by PostgreSQL. The scheduler runs a polling loop every 1 second, executing the query SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'pending' ORDER BY next_run_at LIMIT 1000. For each due job, the scheduler updates the row to status = 'running' and dispatches the job to a worker via synchronous HTTP POST. The worker executes the task (email send, API call, ETL step) and returns the result. The scheduler then updates the job status to 'completed' or 'failed' and, for recurring jobs, computes the next next_run_at from the cron expression.
This architecture has four critical limitations that become visible under load. First, the single scheduler instance is a single point of failure (SPOF). If the process crashes, the host goes down, or a deployment rolls out, all scheduling stops. Jobs pile up as overdue until the scheduler restarts and processes the backlog. There is no automatic failover — a human must notice and restart the service. Second, synchronous HTTP dispatch blocks the scheduler. If a worker takes 30 seconds to process a job, the scheduler's polling thread is blocked for 30 seconds. During this time, other due jobs wait. A batch of slow jobs causes a cascading delay across all scheduled work. Third, there are no idempotency keys. If a worker fails after partially completing a job (sent an email but crashed before returning HTTP 200), the scheduler retries by re-dispatching the same job. The email is sent again. For non-idempotent operations like payment charges or deployment triggers, this causes real damage. Fourth, there is no dead-letter queue. Jobs that fail after max retries (default 3) are marked as 'failed' in the database and forgotten. There is no alerting, no inspection mechanism, and no replay capability.
The polling query itself becomes a bottleneck at scale. With 10K active jobs, the time-bucket index scan is fast (~5ms). With 100K active jobs, it takes ~20ms. With 1M active jobs, the scan reads thousands of index entries and takes 50-100ms, consuming a significant portion of each 1-second polling window. Meanwhile, all CRUD operations (job creation, status queries, deletion) hit the same PostgreSQL instance, competing with the polling query for connections and I/O.
Interviewers expect candidates to identify the SPOF as the primary reliability issue, propose leader election as the solution, explain why message queues decouple scheduling from execution, and discuss idempotency keys for exactly-once execution semantics.
The naive job scheduler is a five-component architecture: Scheduler Client, Scheduler Load Balancer, Scheduler Service, Job Database (PostgreSQL), and Job Worker. There is no API gateway, no cache layer, no message queue, no leader election, and no dead-letter queue.
All external traffic arrives at the Scheduler Load Balancer (AWS ALB), which distributes requests to the single SchedulerService instance using round-robin. The LB adds approximately 1.5ms of routing latency and can handle up to 10K RPS — well above the system's actual limits, which are constrained by the single scheduler instance. The LB is never the bottleneck; the scheduler is.
The SchedulerService handles two responsibilities in a single process. The CRUD path handles job creation (POST /api/v1/jobs), status queries (GET /api/v1/jobs/{id}/runs), deletion (DELETE /api/v1/jobs/{id}), pause/resume (POST /api/v1/jobs/{id}/pause), and manual execution (POST /api/v1/jobs/execute). The scheduler path runs a background thread that polls JobDB every 1 second for due jobs, updates their status to 'running', and dispatches them to JobWorker via HTTP POST. These two responsibilities share the same thread pool (50 threads), the same database connection pool (200 connections), and the same process. If the scheduler thread gets stuck on a slow worker dispatch, CRUD requests are unaffected — but if the DB connection pool is exhausted by polling queries and worker dispatches, CRUD requests start failing.
JobDB (PostgreSQL) stores two tables: jobs (definitions with cron schedules and next_run_at) and executions (append-only history of every run). A single primary instance handles all reads and writes — no read replicas. The time-bucket index on next_run_at enables efficient polling: WHERE next_run_at <= NOW() scans only due rows. However, the index must be maintained on every INSERT and UPDATE, adding write amplification. At peak, the database handles polling queries (1/sec), CRUD writes (~800 RPS), and status reads (~2.4K RPS) simultaneously.
JobWorker is a pool of 10 Fargate tasks that receive HTTP POST requests from the scheduler. Both short tasks (email sends, 100ms) and long tasks (report generation, 30 minutes) run in the same pool with no separation. If all 10 workers are occupied by long tasks, short tasks queue behind them (head-of-line blocking). Workers have no idempotency check — if called twice with the same job parameters, the job runs twice.
The entire system operates in a single AWS region, single AZ, with no redundancy at any layer. A single host failure, AZ outage, or deployment rollout halts all scheduling.
This sequence diagram traces three primary flows: job creation, scheduler polling + dispatch, and status queries. The critical insight is the synchronous HTTP dispatch — the scheduler thread is blocked while the worker executes the job. If a worker takes 30 seconds, the scheduler thread is occupied for 30 seconds and cannot process other due jobs.
The second insight is the lack of idempotency. When the scheduler dispatches a job to the worker and the worker crashes after partial execution, the scheduler retries by creating a new execution record and re-dispatching. The job runs twice with no deduplication check.
Step-by-Step Walkthrough
Pseudocode
// SCHEDULER POLLING LOOP — runs every 1 second
async function schedulerLoop():
while true:
due_jobs = await db.query(
"SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'pending' ORDER BY next_run_at LIMIT 1000"
) // O(K) scan — K due jobs
for job in due_jobs:
await db.execute("UPDATE jobs SET status = 'running' WHERE job_id = $1", [job.job_id])
try:
// BLOCKING — scheduler thread waits for worker response
result = await http.post(worker_url + "/execute", {
task: job.task,
params: job.params
}, { timeout: 30000 })
if job.schedule is cron:
next = compute_next_cron(job.schedule)
await db.execute("UPDATE jobs SET status='pending', next_run_at=$1 WHERE job_id=$2", [next, job.job_id])
else:
await db.execute("UPDATE jobs SET status='completed' WHERE job_id=$1", [job.job_id])
catch error:
if job.retry_count < job.max_retries:
await db.execute("UPDATE jobs SET status='pending', retry_count=retry_count+1 WHERE job_id=$1", [job.job_id])
else:
await db.execute("UPDATE jobs SET status='failed' WHERE job_id=$1", [job.job_id])
// NO DLQ, NO ALERT — job silently lost
await sleep(1000) // Wait 1 second before next pollThe schema reflects the naive approach's single-database design. The jobs table carries the scheduling workload: polled every 1 second via the time-bucket index on next_run_at. The executions table is append-only history with auto-increment IDs and no idempotency key.
The critical column is jobs.next_run_at — indexed with a partial index WHERE status = 'pending' for efficient polling. This column is updated on every job completion (for recurring jobs) and every retry (reset to pending). The status column acts as a simple state machine: pending -> running -> completed/failed.
Step-by-Step Walkthrough
Pseudocode
-- JOBS TABLE: Polled every 1 second
CREATE TABLE jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
schedule TEXT NOT NULL,
task TEXT NOT NULL,
params JSONB,
next_run_at TIMESTAMPTZ NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
retry_count INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_jobs_due ON jobs (next_run_at)
WHERE status = 'pending'; -- Partial index for polling
-- EXECUTIONS TABLE: Append-only, NO idempotency key
CREATE TABLE executions (
execution_id BIGSERIAL PRIMARY KEY, -- auto-increment, NOT idempotent
job_id UUID NOT NULL REFERENCES jobs(job_id),
status TEXT NOT NULL DEFAULT 'running',
started_at TIMESTAMPTZ DEFAULT now(),
finished_at TIMESTAMPTZ,
retry_count INTEGER NOT NULL DEFAULT 0,
error_message TEXT
);
CREATE INDEX idx_executions_job ON executions (job_id, started_at DESC);
-- THE POLLING QUERY (runs every 1 second)
SELECT * FROM jobs
WHERE next_run_at <= NOW() AND status = 'pending'
ORDER BY next_run_at
LIMIT 1000;
-- 10K active jobs: ~5ms | 100K: ~20ms | 1M: ~50-100msChoice
One SchedulerService process handles both CRUD and scheduling
Rationale
Running multiple scheduler instances without coordination causes duplicate job execution — two schedulers both poll for due jobs and dispatch the same job twice. The naive approach avoids this by running exactly one instance. The trade-off is that this single instance is a single point of failure. If it crashes, no jobs are scheduled until manual restart. Leader election (V1) solves this by allowing multiple instances with advisory lock coordination — only the lock holder runs the scheduler loop.
Choice
Scheduler dispatches jobs to workers via blocking HTTP POST
Rationale
Direct HTTP dispatch is the simplest integration pattern — no Kafka, no SQS, no additional infrastructure. The scheduler calls the worker and waits for the response. The cost is coupling: if the worker is slow (30-second report generation), the scheduler thread is blocked for 30 seconds and cannot process other due jobs. If the worker is down, the scheduler blocks until HTTP timeout (30 seconds) before marking the job as failed. Kafka (V1) eliminates this by decoupling dispatch from execution — the scheduler publishes instantly and workers consume at their own pace.
Choice
Execution records do not have unique idempotency identifiers
Rationale
The naive approach uses auto-increment execution IDs with no deduplication check. If a worker crashes after partially completing a job but before returning HTTP 200, the scheduler retries by creating a new execution record and re-dispatching. The job runs twice. For idempotent operations (cache invalidation, report regeneration) this is harmless. For non-idempotent operations (sending emails, charging payments, triggering deployments) this causes real damage. The V1 variant generates execution_id = hash(job_id + scheduled_time) and checks for existence before execution — guaranteeing at-most-once semantics.
Choice
Failed jobs are marked as 'failed' in the database with no further action
Rationale
After 3 retry attempts, permanently failed jobs have their status set to 'failed' in JobDB. There is no alerting, no DLQ for inspection, and no replay mechanism. An operator must manually query the database to discover failed jobs: SELECT * FROM jobs WHERE status = 'failed'. At low job volume this is manageable. At scale, critical job failures can go unnoticed for hours. The V2 variant routes permanently failed jobs to an SQS dead-letter queue with AlertService monitoring.
Choice
SELECT ... WHERE next_run_at <= NOW() every 1 second
Rationale
Polling is the simplest scheduling mechanism. The scheduler runs a SQL query every second to find due jobs. No additional infrastructure is needed — PostgreSQL is already the source of truth. The cost is constant DB load: the polling query runs every second regardless of whether any jobs are due. With 1M active jobs, the time-bucket scan becomes expensive (50-100ms), consuming a significant portion of each polling window. The V2 variant uses an in-memory time-wheel that fires jobs at exact second boundaries without any DB query.
Target RPS
~4K peak (800 CRUD + 2.4K status + scheduler internal)
Latency (p99)
~1-2s schedule precision, ~80ms CRUD, ~40ms status query
Storage
~50 GB/year (job definitions + execution history)
Availability
~99% (single instance, manual restart on failure)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Scheduler polling (SELECT ... WHERE next_run_at <= NOW()) | O(K) — K due jobs in the current polling window | O(K) — K job records loaded into memory | The partial index on (next_run_at) WHERE status='pending' limits the scan to due, pending jobs. At 10K active jobs with 1% due per second: K = 100 rows, ~5ms. At 1M active jobs: K = 10,000 rows, ~50-100ms. The scan time grows linearly with the number of due jobs. |
| Job dispatch (synchronous HTTP POST to worker) | O(1) per job — single HTTP round-trip | O(1) — one request/response pair | Each dispatch takes 10-5000ms depending on task complexity. The scheduler thread is blocked during dispatch. With 50 threads and 50ms average dispatch, max throughput is 1000 jobs/sec. Long tasks (30+ seconds) occupy threads and starve short tasks. |
| Job status query (SELECT ... WHERE job_id = ?) | O(log N) — PK index lookup | O(E) — E execution records for the queried job | Fast per-query (~15ms) but no cache layer — every status query hits PostgreSQL directly. At 2.4K QPS for status queries, the DB handles 2,400 reads/sec competing with polling and CRUD writes. |
Stores job definitions with cron schedules, task identifiers, and next_run_at timestamps. The scheduler polls this table every second using WHERE next_run_at <= NOW() AND status = 'pending'. Simple schema with no shard key, no estimated_duration, and no DAG support.
Indexes: PK on job_id, idx_jobs_due ON (next_run_at) WHERE status = 'pending' (time-bucket scan for polling), idx_jobs_status ON (status) (for status dashboard queries)
The polling query SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'pending' ORDER BY next_run_at LIMIT 1000 uses the idx_jobs_due partial index. At 10K active jobs, the scan takes ~5ms. At 1M active jobs, it takes 50-100ms — consuming a significant portion of each 1-second polling window.
Append-only execution history for every job run. No unique idempotency key — duplicate executions from retries create separate rows with no detection mechanism. Grows continuously with no archival strategy.
Indexes: PK on execution_id, idx_executions_job ON (job_id, started_at DESC) (per-job history lookup)
Without idempotency keys, there is no way to detect duplicate executions from retries. Two execution rows for the same job at the same scheduled time appear as separate runs. The V1 variant replaces auto-increment with execution_id = hash(job_id + scheduled_time) for deduplication.
Scheduler process crashes (OOM, host failure, deployment)
Impact
Total scheduling outage — no jobs fire until manual restart. Jobs pile up as overdue. After restart, the scheduler processes the backlog in next_run_at order, but the burst of overdue jobs may overwhelm workers and cause cascading timeouts. Jobs that were in 'running' state when the crash occurred are stuck — they never transition to 'completed' or 'failed'.
Mitigation
Add leader election via PostgreSQL advisory locks (V1). Multiple scheduler instances compete for the lock — only the holder runs the scheduler loop. If the holder crashes, another instance acquires the lock within 5 seconds and resumes scheduling. Already-enqueued jobs in Kafka continue executing during failover.
Worker pool fully occupied by long-running tasks
Impact
All 10 workers are running 30-minute report generation jobs. Short tasks (email sends, API calls) queue behind them with no priority separation. The scheduler continues polling and dispatching, but HTTP POSTs to the worker pool start timing out after 30 seconds. The scheduler's threads are blocked waiting for responses, reducing its ability to process other due jobs.
Mitigation
Separate short and long worker pools (V1). Short workers (40 pods) handle tasks under 1 minute with low latency. Long workers (10 pods) handle heavy tasks without starving short jobs. The estimated_duration field on the job definition routes to the correct pool.
Database connection pool exhaustion during peak scheduling
Impact
At peak (e.g., midnight when many cron jobs fire), the scheduler dispatches 500 jobs/sec. Each dispatch uses a DB connection to update job status. Combined with CRUD traffic (800 RPS) and status queries (2.4K RPS), the 200-connection pool saturates. New requests queue, latency spikes, and eventually the scheduler falls behind its 1-second polling cadence.
Mitigation
Add a Redis cache for status queries (V1) — reduces DB reads by 80%. Move job dispatch to Kafka (V1) — the scheduler publishes instantly without waiting for worker response, freeing connections. Increase connection pool size as a stopgap.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| SchedulerService (single instance) | Process crash, OOM kill, or host failure | Complete scheduling outage. No jobs fire. CRUD API also goes down since it shares the same process. All traffic returns 503 until manual restart. | Leader election via advisory locks (V1) with multiple standby instances. Kubernetes liveness probe for automatic pod restart within 30 seconds. Process-level watchdog (systemd) for host-level restart. |
| JobDB (PostgreSQL) | Primary instance failure or connection exhaustion | Total system outage — scheduler cannot poll, CRUD API cannot read or write, workers cannot report results. The database is the single dependency for every component. | RDS Multi-AZ for automatic failover (30-60 seconds). PgBouncer for connection pooling. Read replicas for status queries (V1 separates read and write paths). |
| JobWorker (worker pool) | All workers crash or become unresponsive | The scheduler continues polling and dispatching, but all HTTP POSTs timeout after 30 seconds. The scheduler's thread pool fills with blocked threads. Jobs transition to 'running' but never complete. After timeout, they are marked 'failed' and retried (without idempotency). | Circuit breaker pattern: stop dispatching after 5 consecutive worker failures. Back off and retry. With Kafka (V1), worker failures do not block the scheduler — jobs remain in the queue until workers recover. |
Vertical scaling only for the scheduler instance (increase CPU/memory). Worker pool scales horizontally (add more Fargate tasks). Auto-scaling trigger for workers: queue depth > 1000 pending dispatches. The ceiling is approximately 10K active jobs regardless of worker count, because the single scheduler polling loop is the bottleneck. Beyond this ceiling, architectural changes are required: leader election + Kafka (V1) or sharded time-wheels (V2).
Key metrics to monitor: (1) Scheduler heartbeat — is the polling loop running? Alert immediately if no poll cycle completes for 5 seconds. (2) Overdue job count — SELECT COUNT(*) FROM jobs WHERE next_run_at < NOW() - INTERVAL '10 seconds' AND status = 'pending'. Alert if count exceeds 100 (scheduler is falling behind). (3) Job failure rate — percentage of executions with status='failed' in the last 5 minutes. Alert if exceeds 5%. (4) DB connection pool utilization — alert at 70% (140/200), critical at 85%. (5) Worker dispatch latency (p99) — time from scheduling to worker HTTP response. Alert if exceeds 5 seconds. (6) Polling query duration — should be under 50ms. Alert if exceeds 200ms (indicates growing job table or index degradation). Dashboard: Grafana with panels for scheduler polling cadence, overdue job count, execution success/failure ratio, DB connection pool usage, and worker dispatch latency histogram.
At 10K active jobs (single-team deployment): PostgreSQL db.r7g.large (~$200/month), ECS Fargate 1 scheduler pod (~$70/month), ECS Fargate 10 worker pods (~$700/month), ALB (~$30/month). Total: ~$1,000/month. This is the cheapest variant but breaks down beyond 10K active jobs due to the single scheduler bottleneck and lack of failover. The V1 variant at 1M active jobs costs approximately $3,500/month but provides automatic failover, exactly-once execution, and 10x the throughput — the per-job cost decreases from $0.10/job to $0.0035/job as you scale beyond the naive approach's ceiling.
Job parameter validation: task identifiers are validated against a whitelist of registered tasks — arbitrary code execution is not allowed. API authentication via JWT tokens validated on every request (~3ms overhead). Rate limiting: per-tenant request limits prevent a single tenant from overwhelming the scheduler. No encryption at rest for job parameters in PostgreSQL — sensitive parameters (API keys, credentials) should use a secrets manager reference, not inline values. Worker communication is over private VPC — no public internet exposure for the HTTP dispatch path.
Rolling deployment is not safe for the single scheduler instance — the old pod is killed before the new pod starts, creating a scheduling gap. Blue-green deployment is recommended: start the new instance, verify it is polling, then stop the old instance. Total gap: 2-5 seconds during switchover. Database migrations require a maintenance window for schema changes involving the jobs table (the polling query must not fail during migration). Workers can be deployed independently since they are stateless HTTP servers.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Single Scheduler + DB Polling) | T1 | 1-2s schedule precision, 80ms CRUD | ~500 jobs/sec dispatch | $1,000/month | Low | 99% (single instance, manual restart) |
| V1: Leader-Elected Queue (Advisory Lock + Kafka) | T2 | <1s schedule precision, 50ms CRUD | 10K jobs/sec | $3,500/month | Medium | 99.9% (leader failover <5s) |
| V2: Sharded Time-Wheel (N Shards + DAGs) | T3 | <100ms schedule precision, 50ms CRUD | 20K+ jobs/sec | $8,000/month | High | 99.95% (shard isolation, DLQ) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Job scheduling combines four hard distributed systems challenges in one problem: (1) reliability — the scheduler must fire jobs even when components fail, requiring leader election and failover, (2) exactly-once semantics — jobs with side effects must not execute twice on retry, requiring idempotency keys, (3) scalability — polling 1M jobs every second requires sharding or push-based architectures like time-wheels, and (4) workflow orchestration — real-world jobs have dependencies (DAGs) requiring topological execution order. Companies like Airflow, Temporal, AWS Step Functions, and Google Cloud Scheduler ask this question because it is their core product. Engineering teams at every company maintain internal job schedulers for ETL, notifications, and maintenance tasks.
The single scheduler has two bottlenecks: throughput and availability. Throughput: the scheduler polls the DB every second, dispatches jobs via HTTP, and waits for responses. With synchronous dispatch, a single thread can process at most 1000/dispatch_latency_ms jobs per second. If average dispatch takes 50ms, throughput is limited to 20 jobs/sec — far below the 10K/sec target. Availability: the scheduler is a SPOF. If it crashes, deployment rolls out, or the host reboots, all scheduling stops until manual restart. Even with a fast restart (30 seconds), 30 seconds of overdue jobs create a backlog that takes minutes to clear.
If the scheduler crashes after updating the job to status='running' but before the worker starts, the job is stuck in 'running' state forever — no other process will pick it up. If the scheduler crashes after the worker starts but before recording the completion, the job runs but the result is lost — the next scheduler restart will see the job as 'running' and may re-dispatch it (double execution). A heartbeat-based liveness mechanism (timeout jobs stuck in 'running' for more than N minutes) partially mitigates this, but the lack of idempotency means re-dispatch still risks double execution.
Migrate when any of these thresholds are hit: (1) you cannot tolerate manual restart on scheduler failure — leader election provides automatic failover in under 5 seconds, (2) job execution rate exceeds 500/sec — synchronous HTTP dispatch cannot keep up and Kafka decouples scheduling from execution, (3) you need exactly-once execution guarantees — idempotency keys in V1 prevent double execution on retry, (4) failed jobs must be visible — V1 provides structured error tracking that feeds into V2's dead-letter queue. Most production systems should start with V1 from day one; V0 is only appropriate for development and testing environments.
System cron has three problems for distributed applications: (1) no coordination — if you run 10 app servers with cron, each server executes the same job independently, causing 10x duplication, (2) no visibility — cron jobs have no centralized status dashboard, execution history, or alerting. Failures are logged locally and often ignored, (3) no dynamic scheduling — adding or modifying a cron job requires SSH access and config file changes, not an API call. A centralized job scheduler provides coordination (only one instance fires each job), observability (execution history, status dashboard), and an API for dynamic job management.
Sign in to join the discussion.
Ready to design your own Job Scheduler?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator