Vetora logo
Medium8 componentsInterview: High

Job Scheduler — Leader-Elected Queue (Advisory Lock + Kafka)

Industry-standard job scheduler using PostgreSQL advisory locks for leader election and Kafka for reliable job delivery. The leader polls JobDB every second for due jobs and pushes execution events to Kafka. Two worker pools (short and long) prevent head-of-line blocking. Idempotency keys prevent double execution.

ReliabilityKafkaLeader ElectionJob Scheduler
Problem Statement

The leader-elected queue approach to job scheduling represents the industry-standard architecture used by production scheduling systems like Airflow's CeleryExecutor, Sidekiq Enterprise, and Quartz Scheduler. It solves the three fundamental problems with the naive architecture: the single point of failure, the lack of execution buffering, and the absence of idempotency guarantees.

The key insight is separating the scheduling decision from the execution. The scheduler determines which jobs are due and publishes execution events to Kafka. Workers consume events at their own pace. This decoupling means the scheduler never blocks on slow workers — it publishes instantly (5ms) and moves on. If workers are down, events buffer in Kafka (up to 500K messages) until workers recover. This is the same producer-consumer pattern used by every modern message-driven architecture.

Leader election via PostgreSQL advisory locks solves the single point of failure. Multiple JobService instances compete for an advisory lock (pg_try_advisory_lock(42)). Only the lock holder runs the scheduler loop. Other instances serve CRUD and status queries. If the leader crashes, the advisory lock is automatically released, and another instance acquires it within 5 seconds. This provides automatic failover without external coordination services like ZooKeeper or etcd.

Idempotency keys (execution_id = hash(job_id + scheduled_time)) prevent double execution. Before running a job, each worker checks whether an execution record with that execution_id already exists in JobDB. If it does, the duplicate is skipped. This guarantees at-most-once execution even when Kafka redelivers messages (consumer rebalance, network timeout) or when the leader publishes duplicates during failover.

Two worker pools prevent head-of-line blocking. ShortJobWorker (40 pods) handles tasks under 1 minute: email sends, API webhooks, cache invalidations. LongJobWorker (10 pods) handles tasks over 1 minute: report generation, ETL, database backups. The estimated duration is determined at job creation time and encoded in the Kafka message. This is the same tier-based worker pattern used by Sidekiq (queues) and Celery (task routing).

Redis (JobCache) caches recent job execution state for fast status queries. The cache has an 80% hit rate for dashboard queries, reducing JobDB load by 5x on the read path. The cache TTL (60s) ensures reasonably fresh data.

The primary trade-offs are the single-leader bottleneck (one instance polls 1M jobs every second) and the leader failover gap (5-10 seconds during which no new jobs are scheduled). At scale exceeding 10K jobs/sec or requiring sub-second failover, the sharded time-wheel architecture (V2) distributes the scheduling load across N independent shards.

Interviewers expect candidates to explain why leader election is needed, how advisory locks work, why Kafka decouples scheduling from execution, and how idempotency keys prevent double execution on retry.

Architecture Overview

The leader-elected queue architecture uses eight main components organized into four layers: traffic ingestion (SchedulerClient, ApiGateway, MainLB), application service (JobService with leader election), data stores (JobCache/Redis, JobDB/PostgreSQL), and async execution (JobQueue/Kafka, ShortJobWorker, LongJobWorker).

The CRUD path handles external traffic. SchedulerClient sends requests to ApiGateway (authentication, rate limiting at 25K RPS), which routes to MainLB (ALB, round-robin, 40K RPS capacity). MainLB distributes to 6 JobService pods. All pods serve CRUD endpoints: job creation, status queries, deletion, pause/resume, manual execution. This path has a p99 latency of approximately 50ms.

The scheduling path runs exclusively on the leader pod. One of the 6 JobService pods acquires a PostgreSQL advisory lock (pg_try_advisory_lock(42)) and becomes the leader. The leader runs the scheduler loop: every 1 second, it polls JobDB for due jobs (SELECT * FROM jobs WHERE next_run_at <= NOW()), creates execution records with idempotency keys, updates next_run_at for recurring jobs, and publishes execution events to JobQueue (Kafka). The leader check is lightweight — a single SQL call per poll cycle. If the leader crashes, the advisory lock is released automatically (connection termination), and another pod acquires it within 5 seconds.

JobQueue (Amazon MSK / Kafka) provides reliable execution buffering. The scheduler publishes due jobs as events partitioned by job_id (ensuring per-job ordering). Two consumer groups consume from the same topic: ShortJobWorker filters for estimated_duration < 1 min, LongJobWorker filters for >= 1 min. Kafka provides at-least-once delivery — if a worker crashes mid-execution, the message is redelivered to another worker. The idempotency check prevents the redelivered message from causing double execution.

ShortJobWorker (40 pods, 2 vCPU / 4 GB) handles quick tasks with high parallelism. LongJobWorker (10 pods, 4 vCPU / 8 GB) handles heavy tasks with more resources per pod. Each worker: (1) reads execution_id from the Kafka message, (2) checks JobDB for existing execution with that ID (idempotency), (3) if no duplicate, executes the task, (4) writes result to JobDB, (5) updates JobCache.

JobCache (Redis, 13 GB) caches recent execution state. Status queries check Redis first (80% hit rate), falling through to JobDB for historical data. The cache also stores the leader lock state for fast leader-check without a DB round-trip.

Horizontal scaling is independent per layer. JobService pods scale for CRUD throughput (more pods = more CRUD capacity, but only one leader). Workers scale for execution throughput (more pods = more parallel executions). Kafka scales via partition count (32 partitions). The leader is the bottleneck — a single pod polls 1M jobs every second.

Architecture Preview
Loading architecture preview...
Request Flow — Leader Polling + Kafka Dispatch + Idempotency Check

This sequence diagram traces the full scheduling lifecycle: job creation, leader polling, Kafka dispatch, worker execution with idempotency check, and status query with cache. The critical insight is the idempotency check — before executing any job, the worker attempts INSERT ... ON CONFLICT (execution_id) DO NOTHING. If the insert succeeds, the job runs. If it conflicts, the job is a duplicate and is skipped.

The second insight is the leader election. Only one JobService pod runs the scheduler loop. All other pods serve CRUD and status queries. If the leader dies, the advisory lock is released and another pod acquires it within 5 seconds.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Client creates a job via API Gateway. JobService writes to JobDB and caches state in Redis. The job is visible to the leader's next poll cycle
  2. 2Every 1 second, the leader checks its advisory lock, polls for due jobs, creates idempotent execution records, and batch-publishes to Kafka
  3. 3Workers consume from Kafka. Before execution, they attempt an INSERT ... ON CONFLICT on the execution_id. If the insert conflicts (duplicate), the job is skipped. If it succeeds, the job runs
  4. 4On completion, workers update the execution status in JobDB and refresh the cache in Redis. For recurring jobs, the leader computed the next next_run_at during the poll cycle
  5. 5Status queries check Redis first (80% hit rate). Cache misses fall through to JobDB. The cache TTL (60s) ensures reasonably fresh data for dashboards

Pseudocode

// LEADER SCHEDULING LOOP (runs on the advisory lock holder)
async function leaderSchedulerLoop():
    while true:
        is_leader = await db.query("SELECT pg_try_advisory_lock(42)")
        if not is_leader:
            await sleep(5000)  // Retry lock acquisition every 5s
            continue

        due_jobs = await db.query(
            "SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'active' ORDER BY next_run_at LIMIT 1000"
        )

        execution_events = []
        for job in due_jobs:
            exec_id = hash(job.job_id + job.next_run_at)
            await db.execute(
                "INSERT INTO executions (execution_id, job_id, status) VALUES ($1, $2, 'pending')",
                [exec_id, job.job_id]
            )
            execution_events.push({ execution_id: exec_id, job_id: job.job_id, task: job.task, params: job.params })

        await kafka.publishBatch("job-execution", execution_events)  // ~5ms for 1000 messages

        // Batch update next_run_at for recurring jobs
        for job in due_jobs.filter(j => j.schedule.startsWith("cron:")):
            next = compute_next_cron(job.schedule)
            await db.execute("UPDATE jobs SET next_run_at = $1 WHERE job_id = $2", [next, job.job_id])

        await sleep(1000)

// WORKER EXECUTION (with idempotency check)
async function workerConsume(message):
    // Idempotency check — atomic INSERT or skip
    result = await db.execute(
        "INSERT INTO executions (execution_id, job_id, status) VALUES ($1, $2, 'running') ON CONFLICT (execution_id) DO NOTHING RETURNING execution_id",
        [message.execution_id, message.job_id]
    )
    if result.rowCount == 0:
        return  // Duplicate — already executed, skip

    try:
        output = await executeTask(message.task, message.params)
        await db.execute("UPDATE executions SET status='success', finished_at=NOW() WHERE execution_id=$1", [message.execution_id])
        await redis.set("job:" + message.job_id + ":state", { status: "success", last_run_at: new Date() }, { ex: 60 })
    catch error:
        await db.execute("UPDATE executions SET status='failed', error_message=$1 WHERE execution_id=$2", [error.message, message.execution_id])
Database Schema (ER Diagram)

The V1 schema introduces two critical improvements over V0: the idempotent execution_id on the executions table (renamed to job_executions) and multi-tenant isolation via tenant_id on the jobs table. The execution_id is a deterministic hash of job_id + scheduled_time, enabling INSERT ... ON CONFLICT for atomic deduplication.

The jobs table adds tenant_id for multi-tenant isolation. The partial index WHERE status = 'active' ensures the leader's polling query scans only active jobs. The job_executions table replaces auto-increment with UUID execution_id, and the unique constraint enables the idempotency check.

Loading diagram...

Step-by-Step Walkthrough

  1. 1The jobs table stores definitions with cron schedules, next_run_at, and tenant_id. The partial index WHERE status = 'active' enables efficient leader polling — only active jobs are scanned
  2. 2The job_executions table uses deterministic execution_id = hash(job_id + scheduled_time). The unique constraint on execution_id enables INSERT ... ON CONFLICT DO NOTHING for atomic deduplication
  3. 3Multi-tenant isolation: the tenant_id column on jobs scopes all queries. The idx_jobs_tenant index enables per-tenant job listing without scanning the entire table
  4. 4The leader polls with SELECT ... WHERE next_run_at <= NOW() AND status = 'active'. For each due job, it creates an execution record with the deterministic execution_id, then publishes to Kafka

Pseudocode

-- JOBS TABLE: Polled by leader, multi-tenant
CREATE TABLE jobs (
    job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    schedule TEXT NOT NULL,
    task TEXT NOT NULL,
    params JSONB,
    next_run_at TIMESTAMPTZ NOT NULL,
    status TEXT NOT NULL DEFAULT 'active',
    tenant_id UUID NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_jobs_due ON jobs (next_run_at)
    WHERE status = 'active';
CREATE INDEX idx_jobs_tenant ON jobs (tenant_id, status);

-- JOB_EXECUTIONS TABLE: Idempotent execution records
CREATE TABLE job_executions (
    execution_id UUID PRIMARY KEY,  -- hash(job_id + scheduled_time)
    job_id UUID NOT NULL REFERENCES jobs(job_id),
    status TEXT NOT NULL DEFAULT 'pending',
    started_at TIMESTAMPTZ,
    finished_at TIMESTAMPTZ,
    duration_ms INTEGER,
    output TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_executions_job ON job_executions (job_id, created_at DESC);

-- IDEMPOTENCY CHECK (atomic insert-or-skip)
INSERT INTO job_executions (execution_id, job_id, status)
VALUES ('a1b2c3...', 'job-123', 'running')
ON CONFLICT (execution_id) DO NOTHING
RETURNING execution_id;
-- Returns 1 row if new (proceed), 0 rows if duplicate (skip)
Key Design Decisions
Leader Election via PostgreSQL Advisory Locks

Choice

pg_try_advisory_lock(42) for leader selection instead of ZooKeeper or etcd

Rationale

PostgreSQL advisory locks are the simplest leader election mechanism when PostgreSQL is already in the architecture. No additional infrastructure (ZooKeeper cluster, etcd cluster) is needed. The lock is automatically released when the holding connection terminates (crash, network partition). Another instance acquires the lock within 5 seconds. The trade-off is coupling to PostgreSQL — if the DB is down, leader election fails. At this architecture's scale, PostgreSQL is already a critical dependency, so this coupling is acceptable.

Kafka for Job Execution Queue

Choice

Amazon MSK (Kafka) instead of direct HTTP dispatch or SQS

Rationale

Kafka provides three critical capabilities: (1) decoupling — the scheduler publishes instantly without waiting for worker response, (2) buffering — if workers are slow or down, up to 500K messages buffer in Kafka until workers recover, (3) ordering — partitioning by job_id guarantees per-job event ordering, preventing out-of-order execution. SQS provides similar buffering but lacks ordering guarantees. Direct HTTP dispatch (V0) blocks the scheduler on every worker call.

Separate Short and Long Worker Pools

Choice

Two consumer groups filtering by estimated_duration

Rationale

A 30-minute report generation job sitting in the same queue as a 100ms email send causes head-of-line blocking. With separate pools, short workers (40 pods) maintain low latency for quick tasks while long workers (10 pods) handle heavy tasks without interference. This is the tier-based worker pattern used by Sidekiq queues and Celery task routing.

Idempotency Keys for Exactly-Once Execution

Choice

execution_id = hash(job_id + scheduled_time) checked before each execution

Rationale

Kafka provides at-least-once delivery — messages can be redelivered on consumer rebalance or network timeout. Without idempotency, a redelivered message causes double execution. The execution_id is deterministic: for a given job at a given scheduled time, only one execution can exist. Workers check for this ID in JobDB before running — if it exists, the duplicate is skipped. This converts at-least-once delivery to effectively-once execution.

Redis Cache for Status Queries

Choice

JobCache (Redis) with 60s TTL for recent execution state

Rationale

Status queries (GET /api/v1/jobs/{id}/runs) are the highest-volume traffic (60% of requests). Dashboards and monitoring poll every few seconds. Caching recent execution state in Redis (80% hit rate) reduces JobDB read load by 5x. The 60s TTL ensures reasonably fresh data — acceptable for status dashboards. The cache also stores leader lock state for fast leader-check without a DB round-trip.

DB Polling Instead of Push-Based Scheduling

Choice

SELECT ... WHERE next_run_at <= NOW() every 1 second

Rationale

Kafka does not natively support delayed delivery (SQS max delay is 15 minutes). Polling the DB every second gives sub-second scheduling precision for all time horizons — from 1 second to 1 year in the future. The polling query uses a time-bucket index (WHERE next_run_at <= now) which scans only due rows. At 1M active jobs, the scan takes approximately 50ms per cycle. The V2 variant eliminates polling entirely with an in-memory time-wheel.

Scale & Performance

Target RPS

~20K peak (4K CRUD + 12K status + 4K execution triggers)

Latency (p99)

<1s schedule precision, 50ms CRUD p99, 10ms status (cache hit)

Storage

~200 GB/year (job definitions + execution history + Kafka retention)

Availability

99.9% (leader failover <5s, Kafka buffering)

Time & Space Complexity
OperationTimeSpaceNotes
Leader polling (SELECT ... WHERE next_run_at <= NOW())O(K) — K due jobs in the current windowO(K) — K job records in memoryThe partial index limits the scan to active, due jobs. At 1M active jobs with 1000 due per second: K = 1000, ~10ms. The scan time is proportional to due jobs, not total jobs.
Kafka publish (batch of due jobs)O(K) — K messages published per poll cycleO(1) — producer buffers are boundedKafka batch publish of 1000 messages takes ~5ms. The scheduler publishes all due jobs in a single batch, then updates their next_run_at in a single SQL batch UPDATE. Total poll-to-publish time: ~15ms per cycle.
Idempotency check (INSERT ... ON CONFLICT)O(1) — single PK lookup on execution_idO(1) — constant per checkThe idempotency check is a single INSERT with ON CONFLICT DO NOTHING. If the execution_id already exists, the insert is a no-op (~1ms). If it does not exist, the insert proceeds (~5ms). No lock contention — the unique constraint handles concurrency.
Database Schema (HLD)
jobs

Stores job definitions with cron schedules, task identifiers, and next_run_at timestamps. The leader scheduler polls this table every second using the partial index on next_run_at WHERE status = 'active'. Advisory locks for leader election use pg_try_advisory_lock on a well-known lock ID.

job_id UUID PK DEFAULT gen_random_uuid()schedule TEXT NOT NULL (cron expression or ISO timestamp)task TEXT NOT NULL (task identifier)params JSONB (task parameters)next_run_at TIMESTAMPTZ NOT NULL (indexed for polling)status TEXT NOT NULL DEFAULT 'active' (active|paused|deleted)tenant_id UUID NOT NULL (multi-tenant isolation)created_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on job_id, idx_jobs_due ON (next_run_at) WHERE status = 'active' (time-bucket scan for leader polling), idx_jobs_tenant ON (tenant_id, status) (per-tenant job listing)

The leader polls using SELECT * FROM jobs WHERE next_run_at <= NOW() AND status = 'active' ORDER BY next_run_at LIMIT 1000. The partial index scans only active jobs. At 1M active jobs with 0.1% due per second: ~1000 rows returned per poll, ~10ms scan time.

job_executions

Append-only execution history with unique execution_id for idempotency. execution_id = hash(job_id + scheduled_time) ensures exactly one execution per job per scheduled time. Workers check this table before execution to prevent duplicates.

execution_id UUID PK (hash of job_id + scheduled_time — idempotency key)job_id UUID FK REFERENCES jobs(job_id)status TEXT NOT NULL (pending|running|success|failed)started_at TIMESTAMPTZfinished_at TIMESTAMPTZduration_ms INTEGERoutput TEXT (stdout/stderr from execution)created_at TIMESTAMPTZ DEFAULT now()

Indexes: PK on execution_id (also serves as idempotency check via INSERT ... ON CONFLICT), idx_executions_job ON (job_id, created_at DESC) (per-job history lookup)

The idempotency check is: INSERT INTO job_executions (execution_id, ...) ON CONFLICT (execution_id) DO NOTHING. If the insert succeeds, proceed with execution. If it conflicts (duplicate), skip. This is an atomic, lock-free deduplication mechanism.

Event Contracts
job-executionjob-execution

Published by the leader scheduler when a job is due. Consumed by ShortJobWorker and LongJobWorker based on estimated_duration filtering.

Key Schema

job_id (string) — partition key for per-job ordering

Value Schema

{ execution_id: string (idempotency key), job_id: string, task: string, params: object, estimated_duration: string (short|long), scheduled_time: string (ISO timestamp) }

What-If Scenarios

Leader pod crashes during a poll cycle

Impact

The leader has identified 500 due jobs, published 300 to Kafka, and updated next_run_at for those 300 in JobDB. The remaining 200 were not published. When the new leader acquires the advisory lock (5-10 seconds), it polls JobDB and finds those 200 still due. It publishes them to Kafka — but the 300 already published may also be re-published if the next_run_at update was not committed. The idempotency check prevents double execution: workers check execution_id before running.

Mitigation

Ensure the Kafka publish and the JobDB next_run_at update are in the same transactional scope. Use Kafka transactions or the outbox pattern: write execution records to JobDB in the same transaction as next_run_at updates, then relay them to Kafka asynchronously.

Kafka consumer rebalance causes message redelivery

Impact

When a worker pod is added, removed, or crashes, Kafka triggers a consumer rebalance. Messages that were consumed but not committed are redelivered to other consumers. Without idempotency, these jobs would execute twice. With the idempotency check (execution_id exists in JobDB), the duplicate messages are detected and skipped — no double execution.

Mitigation

The idempotency check is the primary mitigation. Additionally, workers should commit offsets only after the execution result is written to JobDB (at-least-once delivery with application-level deduplication). This pattern is standard in Kafka consumer best practices.

Burst of cron jobs at midnight (e.g., 50K daily jobs fire simultaneously)

Impact

At midnight, 50K daily cron jobs become due simultaneously. The leader polls and finds all 50K jobs. Publishing 50K messages to Kafka takes ~250ms (batch publish). Updating 50K next_run_at values in JobDB takes ~500ms (batch UPDATE). The total poll cycle takes ~750ms — fitting within the 1-second window but leaving only 250ms of headroom. Workers receive 50K messages simultaneously, consuming them over the next 5-10 seconds across 40 short + 10 long worker pods.

Mitigation

Rate-limit the publish batch size to 5K per poll cycle. Spread the remaining 45K across subsequent cycles (9 more seconds to clear the backlog). Alternatively, use the V2 sharded approach where 3 shards each handle 17K jobs — 3x the per-shard throughput headroom.

Failure Modes & Resilience
ComponentFailureImpactMitigation
Leader JobService (scheduler loop)Leader pod crash or network partition from PostgreSQLScheduling stops for 5-10 seconds during leader failover. CRUD API continues serving from non-leader pods. Workers continue processing already-enqueued Kafka messages. After failover, the new leader processes the overdue backlog.Advisory lock timeout tuning: reduce the lock check interval from 5s to 2s. Warm standby: have a designated successor pod attempt lock acquisition every 1 second. Monitor leader heartbeat and alert immediately on loss.
Kafka (JobQueue)Broker failure or network partitionThe scheduler cannot publish due jobs. Jobs are identified as due but cannot be enqueued. Workers have no new work to consume (but finish existing work). Jobs become overdue until Kafka recovers. After recovery, the scheduler publishes the overdue backlog.Multi-AZ Kafka deployment (3 brokers across 3 AZs). Local buffer on the scheduler for failed publishes with exponential backoff retry. Monitor Kafka broker health and ISR (in-sync replicas) count.
Redis (JobCache)Cache eviction storm or Redis node failureStatus queries fall back to JobDB, increasing database read load by 5x. The scheduler and workers are unaffected — they do not depend on Redis for execution. CRUD API latency increases from 10ms (cache hit) to 40ms (DB read). Dashboard refresh rate may need to be reduced.Redis Cluster with automatic failover. Set maxmemory-policy to allkeys-lru. Implement graceful degradation: if Redis is down, serve stale data from a local in-process cache (30-second TTL) before falling through to DB.
Scaling Strategy

JobService pods scale horizontally for CRUD throughput (6 -> 12 -> 20 pods). Only one pod is the leader — adding pods increases CRUD capacity but not scheduling throughput. Workers scale independently: ShortJobWorker auto-scales based on Kafka consumer lag (target: <1K lag), LongJobWorker auto-scales based on CPU utilization (target: <70%). Kafka scales via partition count increase (32 -> 64 -> 128). The ceiling is approximately 10K jobs/sec at the leader scheduler — beyond this, the V2 sharded approach distributes polling across N shards.

Monitoring & Alerting

Key metrics: (1) Leader health — is the advisory lock held? Alert within 10 seconds of lock release. (2) Poll cycle duration — time from SELECT to Kafka publish completion. Alert if exceeds 800ms (approaching the 1-second window). (3) Overdue job count — jobs where next_run_at < NOW() - 10s. Alert if count exceeds 100. (4) Kafka consumer lag — messages waiting to be consumed by workers. Alert if lag exceeds 10K messages (workers are falling behind). (5) Idempotency skip rate — percentage of Kafka messages skipped due to duplicate execution_id. Normal: <0.1%. Alert if exceeds 1% (indicates duplicate publishing). (6) Worker execution latency (p99) — by short vs long pool. Alert if short pool p99 exceeds 30s. (7) Cache hit rate — should be ~80%. Alert if drops below 60% (eviction pressure or working set shift).

Cost Analysis

At 1M active jobs: PostgreSQL db.r7g.xlarge (~$350/month), Redis cache.r7g.large (~$150/month), MSK kafka.m7g.xlarge 3 brokers (~$900/month), ECS Fargate 6 JobService pods (~$420/month), 40 ShortJobWorker pods (~$2,800/month), 10 LongJobWorker pods (~$700/month), API Gateway (~$50/month), ALB (~$30/month). Total: ~$5,400/month. Per-job cost: $0.0054/job/month. The leader-elected approach is cost-effective up to 10K jobs/sec. Beyond that, the V2 sharded approach ($8,000/month) provides 2x throughput with shard isolation and DLQ support.

Security Considerations

API authentication via JWT tokens on every request (~3ms overhead). Multi-tenant isolation: jobs are scoped by tenant_id — queries filter by tenant_id to prevent cross-tenant data access. Job parameters stored as JSONB in PostgreSQL — sensitive values (API keys, credentials) should reference a secrets manager (AWS Secrets Manager) rather than containing inline values. Kafka messages encrypted in transit (TLS) and at rest (KMS). Worker execution sandboxed — tasks run in isolated Fargate containers with no shared filesystem. Advisory lock ID is a well-known constant (42) — not a security risk since it requires a PostgreSQL connection to the same database.

Deployment Strategy

Rolling deployment for JobService pods — replace one pod at a time while the ALB routes traffic to remaining pods. If the leader pod is being replaced, the advisory lock is released, and another pod acquires it within 5 seconds. Zero-downtime deployment for workers — Kafka consumer rebalance automatically redistributes partitions to remaining workers during deployment. Database migrations run during low-traffic windows with a brief maintenance window for schema changes requiring table locks. Redis cache warms naturally after deployment (60-second TTL means full warm-up within 2 minutes).

Real-World Examples
  • Apache Airflow's CeleryExecutor uses a single scheduler process that polls the metadata database for due DAG runs and publishes tasks to a Celery (Redis/RabbitMQ) queue — the exact pattern of this V1 approach
  • Sidekiq Enterprise uses a leader-elected scheduler (via Redis SETNX) that polls scheduled jobs and enqueues them to Redis queues for worker consumption
  • Quartz Scheduler (Java) uses database-backed clustering with row-level locks for leader election and JDBC job store polling — conceptually identical to advisory lock polling
Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Single Scheduler + DB Polling)T11-2s schedule precision, 80ms CRUD~500 jobs/sec dispatch$1,000/monthLow99% (single instance, manual restart)
V1: Leader-Elected Queue (Advisory Lock + Kafka)T2<1s schedule precision, 50ms CRUD10K jobs/sec$3,500/monthMedium99.9% (leader failover <5s)
V2: Sharded Time-Wheel (N Shards + DAGs)T3<100ms schedule precision, 50ms CRUD20K+ jobs/sec$8,000/monthHigh99.95% (shard isolation, DLQ)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why advisory locks instead of ZooKeeper for leader election?

PostgreSQL advisory locks eliminate the need for an additional coordination service. ZooKeeper requires a 3-node cluster (for quorum), monitoring, and operational expertise. Advisory locks are built into PostgreSQL, which is already a critical dependency in this architecture. The lock is acquired with pg_try_advisory_lock(42) and released automatically when the connection terminates. The trade-off is that leader election depends on PostgreSQL availability — if the DB is down, no leader can be elected. Since the scheduler also depends on the DB for job data, this coupling is acceptable. At larger scale with multiple schedulers needing independent coordination, etcd or ZooKeeper may be preferred.

What happens during the leader failover gap?

When the leader dies, it takes 5-10 seconds for the advisory lock to be released (PostgreSQL detects the terminated connection) and another instance to acquire it. During this gap, no new jobs are scheduled — the polling loop is not running. However, already-enqueued jobs in Kafka continue executing normally. Workers are unaffected by the leader transition. After the new leader acquires the lock, it immediately polls for overdue jobs and catches up the backlog. The maximum scheduling delay is the failover gap plus the backlog processing time.

How does the idempotency key prevent double execution?

The execution_id is computed deterministically: hash(job_id + scheduled_time). For a given job at a given scheduled time, there is exactly one valid execution_id. Before running a task, the worker checks JobDB: SELECT 1 FROM executions WHERE execution_id = ?. If the record exists (previous execution started or completed), the worker skips the duplicate. If it does not exist, the worker INSERTs the record (with a unique constraint on execution_id) and proceeds. Even if two workers receive the same Kafka message simultaneously, only one INSERT succeeds — the other gets a unique constraint violation and skips.

Why not use Kafka's delayed message feature?

Kafka does not natively support delayed messages. There is no built-in way to say 'deliver this message at 3:00 PM tomorrow.' Some workarounds exist (delay topics with consumer-side filtering, custom plugins) but they add complexity and have limitations. SQS supports delayed delivery but only up to 15 minutes. DB polling with a time-bucket index provides scheduling precision for any time horizon — from 1 second to 1 year — with a simple, well-understood mechanism. The V2 variant replaces polling with an in-memory time-wheel for even lower overhead.

How does the system handle a Kafka outage?

If Kafka is unavailable, the leader scheduler cannot publish execution events. Due jobs are identified by the polling query but cannot be enqueued. The scheduler should implement a local buffer: hold events in memory and retry Kafka publishes with exponential backoff. Already-enqueued messages in Kafka continue to be consumed by workers (Kafka stores committed data durably). The CRUD API path is unaffected — job creation, status queries, and deletion work normally since they depend on PostgreSQL and Redis, not Kafka.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Job Scheduler?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator