Medium5 componentsInterview: Very High

Ride Hailing — Naive (Monolith + SQL Distance Sort)

Q: Why is ride-hailing the most common system design interview question?

Ride-hailing combines five hard distributed systems challenges in one problem: (1) real-time geospatial matching — finding the nearest driver using spatial indexing, (2) high-throughput location streaming — 250K GPS updates/sec at Uber scale, (3) ride lifecycle state machine — 8 states with strict transition rules, (4) payment processing — async saga with compensation for failures, and (5) surge pricing — supply/demand computation per geo cell. Uber, Lyft, DiDi, and Grab ask it because it is their core business. Google, Amazon, and Meta ask it because it tests distributed systems fundamentals (geo-sharding, event sourcing, state machines) without domain-specific knowledge.

Q: Why does the O(N) matching scan fail at scale?

ST_Distance computes the Vincenty formula (great-circle distance) for every available driver row. PostgreSQL cannot short-circuit — it must evaluate all N rows to guarantee finding the true nearest 5 (ORDER BY requires seeing all values). At 10K drivers, this takes ~120ms. At 50K, ~400ms. At 100K, over 1 second. Meanwhile, each scan holds shared locks on multiple pages, blocking concurrent location UPDATEs. A KNN-GiST index helps somewhat but degrades under write contention because each location UPDATE modifies the index structure while matching scans are reading it.

Q: At what scale should you migrate from PostgreSQL to Redis GEO?

Migrate when matching latency p99 exceeds 300ms or when location UPDATE contention causes the database to exceed 70% CPU utilization during peak hours. In this simulation, the inflection point is around 10K-20K drivers per city. Below 10K, PostGIS is simpler and eliminates an infrastructure dependency. Above 20K, the O(N) scan becomes untenable. Redis GEORADIUS handles 1M drivers with 2ms queries — the same result that took 1+ seconds with ST_Distance.

Q: How does the naive approach handle driver going offline mid-ride?

Poorly. The driver stops sending GPS updates, but the system does not know the difference between a driver who turned off the app and a driver passing through a cellular dead zone (tunnel, parking garage). The rider continues polling and sees stale driver location. There is no timeout mechanism, no heartbeat, and no fallback matching. The V3 variant uses a ride state machine with heartbeat timeouts — if the driver stops updating for 30 seconds during an active ride, the system transitions to a DRIVER_UNRESPONSIVE state and begins reassignment.

Q: Why not just add a PostGIS spatial index?

PostGIS supports two types of spatial indexes: GiST (Generalized Search Tree) and SP-GiST. GiST is effective for range queries (ST_DWithin — find all drivers within 5km) but does not accelerate ORDER BY ST_Distance (nearest-neighbor sort). PostgreSQL 9.5+ supports KNN-GiST indexes for nearest-neighbor queries, but these indexes degrade under concurrent writes because each location UPDATE restructures the index tree. At 2,500 UPDATEs/sec, the index is constantly being modified, causing reader stalls. Redis GEO avoids this because reads and writes operate on different parts of the geohash sorted set.

The simplest possible ride-hailing architecture: a single monolith service backed by PostgreSQL with PostGIS. Driver matching uses brute-force SELECT ... ORDER BY ST_Distance, an O(N) full scan on every match request. Demonstrates why geospatial indexing and event-driven architectures become essential as driver counts grow.

ComputeBeginnerBottleneck AnalysisRide Hailing

Try in Simulator

Problem Statement

Ride-hailing is one of the most commonly asked system design interview questions because it combines real-time geospatial computation, high-throughput location streaming, transactional ride lifecycle management, payment processing, and surge pricing into a single problem. Companies like Uber, Lyft, DiDi, Grab, Ola, and Bolt all ask variants of this question because it directly maps to their production engineering challenges.

The naive approach uses the simplest possible architecture: a single monolith service backed by PostgreSQL with the PostGIS extension. Driver locations are stored as geometry columns and updated via UPDATE every 4 seconds (10K drivers at peak = 2,500 UPDATEs/sec). When a rider requests a ride, the monolith runs the matching query: SELECT driver_id, ST_Distance(location, ST_MakePoint(?, ?)) AS dist FROM drivers WHERE status = 'available' ORDER BY dist LIMIT 5. This computes the distance between the pickup point and every available driver, sorts the entire result set, and returns the top 5 nearest. It is an O(N) sequential scan — PostgreSQL must evaluate every row because ORDER BY ST_Distance cannot use a standard B-tree index.

At 10K drivers, this scan takes 80-200ms — barely acceptable. At 50K drivers, it exceeds 400ms. At 100K drivers, it exceeds 1 second and the system becomes unusable. Meanwhile, 2,500 location UPDATEs/sec are acquiring row-level locks on the same drivers table, creating write contention that further degrades matching query performance. The GiST index on the geometry column helps with range queries (ST_DWithin) but not with nearest-neighbor sorting (ORDER BY ST_Distance), which requires a KNN-GiST index that degrades under concurrent writes.

The architecture has no real-time tracking — riders poll GET /api/v1/rides/{id} every 3-5 seconds to see the driver's position. This creates additional read load (700 QPS at peak) competing with location UPDATEs and match queries for the same database connections. There is no surge pricing (flat rates regardless of demand), no event stream (the match result, payment, and notification are all handled synchronously), and no redundancy (a database failure means total downtime).

This template exists to make the O(N) matching bottleneck visible and measurable. Run the simulation at increasing driver counts and watch matching latency grow linearly while location UPDATE contention creates a cascading slowdown. The comparison with the Geo-Indexed Match variant (V1) quantifies the dramatic improvement: Redis GEORADIUS drops matching from 200ms to 2ms — a 100x speedup from switching O(N) full-scan to O(log N + M) geo-indexed lookup.

Interviewers expect candidates to identify the O(N) scan as the primary bottleneck, propose geospatial indexing (Redis GEO, Geohash, or H3) as the solution, discuss the write contention from concurrent location UPDATEs, and reason about the transition from polling to WebSocket for real-time tracking.

Architecture Overview

The naive ride-hailing system is a five-component architecture: Rider Client, Driver Client, Load Balancer, Monolith Service, PostgreSQL Database, and Redis Session Cache. There is no geospatial index, no event stream, no WebSocket service, and no separation between location ingestion, ride matching, and status queries.

All traffic arrives at the Load Balancer (AWS ALB), which distributes requests across Monolith pods using round-robin. The Load Balancer adds approximately 1.5ms of routing latency and can handle up to 30K RPS — well above the system's actual limits, which are constrained by the database. The Load Balancer is never the bottleneck; the database is.

The Monolith handles three types of requests. Location updates (90% of traffic): drivers POST their GPS coordinates every 4 seconds, and the Monolith executes UPDATE drivers SET location = ST_MakePoint(lng, lat) WHERE driver_id = ?. Each UPDATE acquires a row-level lock. Ride requests (3% of traffic): riders POST a ride request, and the Monolith performs the O(N) ST_Distance scan to find the nearest available driver, then INSERTs a ride record. Status queries (7% of traffic): riders poll GET /rides/{id} every 3-5 seconds to check ride status and driver location, triggering a JOIN between rides and drivers.

PostgreSQL stores three tables: drivers (with a PostGIS geometry column for location), rides (ride records with status lifecycle), and riders (accounts and payment methods). A single primary instance with no read replicas handles all reads and writes. At peak: 2,500 location UPDATEs/sec + 300 match queries/sec + 700 status SELECTs/sec = 3,500 ops/sec. The database connection pool can sustain approximately 5K concurrent operations before saturation.

Redis serves as an optional session cache for authenticated tokens and recent ride status, reducing DB read load for status queries by approximately 40%. It is explicitly not used for geospatial indexing — that is the V1 variant's approach.

Settlement is synchronous: when a ride completes, the Monolith calculates the fare (base + distance x rate + time x rate), updates the ride record with the fare, and triggers payment directly. There is no async pipeline, no compensation on failure, and no saga pattern. If the payment call fails, the rider sees an error and must retry manually.

Architecture Preview

Loading architecture preview...

Open in Simulator

Request Flow — Driver Matching via ST_Distance Scan

This sequence diagram traces three primary flows: driver location updates, ride matching, and ride status polling. The critical insight is the O(N) matching scan — PostgreSQL must compute ST_Distance for every available driver to find the nearest 5. During high driver counts, this scan takes 200ms+ while competing with thousands of concurrent location UPDATEs for the same database connections and GiST index structures.

The second insight is the write contention between location UPDATEs and matching queries. Both operate on the drivers table: UPDATEs modify the location column (acquiring row-level locks), while matching scans read every row's location. Under concurrent load, readers can be blocked by writers holding row locks, and the GiST index rebuild from UPDATEs interferes with the index traversal needed for matching scans.

Loading diagram...

Step-by-Step Walkthrough

1Driver sends GPS coordinates every 4 seconds. The Monolith executes UPDATE on the drivers table, acquiring a row-level lock and triggering GiST index maintenance. At 10K drivers: 2,500 UPDATEs/sec
2Rider requests a ride. The Monolith executes the O(N) matching query: SELECT ... ORDER BY ST_Distance ... LIMIT 5. PostgreSQL computes distance for every available driver, sorts the full result set, returns top 5. This is a sequential scan — no index can accelerate ORDER BY ST_Distance
3The matched driver's row is UPDATEd to status='busy' and a ride record is INSERTed. If another ride request matched the same driver concurrently, one UPDATE fails — requiring a re-match (~200ms delay)
4The rider polls GET /rides/{id} every 3-5 seconds. The Monolith first checks Redis for cached status. On cache miss, it JOINs rides with drivers to get current driver location, adding ~20ms DB load per miss
5All three flows compete for the same PostgreSQL connection pool (200 connections max). At peak, 2,500 UPDATEs + 300 matching scans + 700 status queries = 3,500 ops/sec. Connection pool saturates around 5K concurrent operations

Pseudocode

// DRIVER LOCATION UPDATE — every 4 seconds per driver
async function updateDriverLocation(driver_id, lat, lng, heading, speed):
    await db.execute(
        "UPDATE drivers SET location = ST_MakePoint($1, $2), heading = $3, speed = $4, updated_at = NOW() WHERE driver_id = $5",
        [lng, lat, heading, speed, driver_id]
    )   // ~50ms — row lock + GiST index maintenance
    // At 10K drivers: 2,500 of these per second
    return 200

// RIDE MATCHING — O(N) sequential scan
async function requestRide(rider_id, pickup_lat, pickup_lng, dest_lat, dest_lng):
    // Step 1: Find nearest available drivers (THE BOTTLENECK)
    drivers = await db.execute(
        "SELECT driver_id, ST_Distance(location, ST_MakePoint($1, $2)) AS dist " +
        "FROM drivers WHERE status = 'available' ORDER BY dist LIMIT 5",
        [pickup_lng, pickup_lat]
    )   // O(N) scan: 10K drivers = ~120ms, 50K = ~400ms, 100K = ~1s+

    best_driver = drivers[0]  // Nearest by straight-line distance

    // Step 2: Assign driver + create ride (within transaction)
    ride_id = uuid()
    await db.begin()
    await db.execute("UPDATE drivers SET status='busy' WHERE driver_id=$1", [best_driver.id])
    await db.execute(
        "INSERT INTO rides (ride_id, rider_id, driver_id, status, pickup_lat, pickup_lng, dest_lat, dest_lng) " +
        "VALUES ($1, $2, $3, 'MATCHED', $4, $5, $6, $7)",
        [ride_id, rider_id, best_driver.id, pickup_lat, pickup_lng, dest_lat, dest_lng]
    )
    await db.commit()  // ~50ms — transaction + WAL flush
    return { ride_id, driver: best_driver }

Database Schema (ER Diagram)

The schema reflects the naive approach's single-database design. The drivers table carries the geospatial workload: 2,500 UPDATEs/sec for location streaming plus O(N) scans for matching. The rides table records ride lifecycle. Both tables share the same PostgreSQL instance, creating mutual degradation under load.

The critical column is drivers.location — a PostGIS GEOMETRY(Point, 4326) column indexed with a GiST index. This column is the hottest data in the system: written 2,500 times/sec and read by every matching query. The GiST index helps with range queries (ST_DWithin) but not with the ORDER BY ST_Distance pattern used for nearest-neighbor matching.

Loading diagram...

Step-by-Step Walkthrough

1The drivers table stores current driver positions as PostGIS geometry points. The GiST index on location supports ST_DWithin range queries but cannot accelerate ORDER BY ST_Distance (nearest-neighbor requires full scan or KNN-GiST which degrades under write contention)
2The rides table records ride lifecycle: REQUESTED -> MATCHED -> IN_PROGRESS -> COMPLETED. The driver_id is null until matching completes. Fare is calculated on ride completion based on distance and time
3The riders table stores account information and payment method references. Small table, fully cached in buffer pool — not a performance concern
4The status column on drivers is the key for matching: only 'available' drivers are considered. The partial index WHERE status = 'available' reduces the scan set but does not eliminate the O(N) distance computation
5All three tables share the same connection pool (200 max connections). At peak: 2,500 driver UPDATEs + 300 ride INSERTs + 700 status SELECTs = 3,500 concurrent operations competing for 200 connections

Pseudocode

-- DRIVERS TABLE: Hottest table — 2,500 UPDATEs/sec
CREATE TABLE drivers (
    driver_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    status TEXT NOT NULL DEFAULT 'available',  -- available / busy / offline
    location GEOMETRY(Point, 4326) NOT NULL,   -- PostGIS geometry
    heading FLOAT,
    speed FLOAT,
    updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_drivers_location ON drivers USING GIST (location);
CREATE INDEX idx_drivers_available ON drivers (status) WHERE status = 'available';

-- RIDES TABLE: Ride lifecycle records
CREATE TABLE rides (
    ride_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    rider_id UUID NOT NULL REFERENCES riders(rider_id),
    driver_id UUID REFERENCES drivers(driver_id),  -- null until matched
    status TEXT NOT NULL DEFAULT 'REQUESTED',
    pickup_lat FLOAT NOT NULL,
    pickup_lng FLOAT NOT NULL,
    dest_lat FLOAT NOT NULL,
    dest_lng FLOAT NOT NULL,
    fare_cents INTEGER,
    created_at TIMESTAMPTZ DEFAULT now()
);

-- THE BOTTLENECK QUERY: O(N) distance computation
SELECT driver_id, ST_Distance(
    location, ST_MakePoint(-73.985, 40.748)  -- Times Square
) AS dist
FROM drivers WHERE status = 'available'
ORDER BY dist LIMIT 5;
-- 10K drivers: ~120ms | 50K: ~400ms | 100K: >1s

Key Design Decisions

Brute-Force ST_Distance Matching

Choice

SELECT ... ORDER BY ST_Distance(location, pickup) LIMIT 5 for driver matching

Rationale

PostGIS ST_Distance computes the great-circle distance between two geometry points using the Vincenty formula. Combined with ORDER BY ... LIMIT 5, it finds the nearest 5 drivers to the pickup point. The problem is that PostgreSQL must compute ST_Distance for every available driver row — there is no way to short-circuit the scan. At 10K drivers this takes 80-200ms. At 100K drivers it exceeds 1 second. Redis GEORADIUS achieves the same result in 2ms using a geohash-based sorted set (O(log N + M) vs O(N)).

Location UPDATE on Driver Row

Choice

UPDATE drivers SET location = ST_MakePoint(lng, lat) every 4 seconds per driver

Rationale

Each driver has one row in the drivers table, and location is a column on that row. UPDATE is simpler than maintaining a separate location history table — one row per driver, no cleanup, no deduplication. The cost is write contention: at 2,500 UPDATEs/sec, row-level locks create queuing. Each UPDATE also triggers GiST index maintenance on the geometry column, adding write amplification. An INSERT-based approach (append-only location log) avoids row locks but requires separate queries to find the latest position.

Single PostgreSQL for Everything

Choice

One database for driver locations, ride records, and rider accounts

Rationale

A single database eliminates data synchronization complexity. Location updates, ride records, and rider accounts all live in PostgreSQL with ACID transactions. The cost is resource contention: location UPDATEs (90% of traffic) compete with matching scans (3%) and status queries (7%) for I/O, connections, and buffer pool memory. Adding a read replica would offload status queries but does not help with location UPDATEs (writes) or matching scans (which need the freshest data from the primary).

Poll-Based Ride Tracking

Choice

Riders poll GET /rides/{id} every 3-5 seconds instead of WebSocket

Rationale

WebSocket requires a persistent connection server, connection management, and a pub/sub mechanism to push updates to the right client. Polling is simpler: the rider's app calls GET /rides/{id} every 3-5 seconds and displays the driver's current location from the response. The cost is wasted bandwidth (most polls return unchanged data) and 3-9 second staleness (up to 4 seconds from driver GPS interval plus up to 5 seconds from poll interval). At 10K active rides, polling generates approximately 3K QPS of mostly redundant queries.

No Surge Pricing

Choice

Flat rates regardless of supply/demand ratio

Rationale

Surge pricing requires real-time supply/demand metrics per geo cell: count available drivers and pending ride requests within each cell, compute the ratio, and apply a multiplier. The naive architecture has no mechanism for this — it would require additional GROUP BY queries on every ride request, further loading the already-strained database. The V1 variant introduces Kafka-based event streaming that enables real-time supply/demand computation without additional database load.

Scale & Performance

Target RPS

~10K sustained (ceiling at database)

Latency (p99)

80-200ms match, 15-50ms status, 50-100ms location update

Storage

~100 GB/year at moderate traffic (10K drivers/city)

Availability

~99% (single instance, no redundancy)

Time & Space Complexity

Operation	Time	Space	Notes
Driver matching (ORDER BY ST_Distance)	O(N) — sequential scan over all available drivers	O(N) — full sort of all driver distances	PostgreSQL must compute ST_Distance for every available driver to find the true nearest 5. At 10K drivers: ~120ms. At 100K drivers: >1 second. This is the primary bottleneck.
Location update (UPDATE drivers SET location)	O(log N) — B-tree + GiST index update	O(1) — single row update	Each UPDATE acquires a row-level lock (~40ms hold time under contention) and triggers GiST index rebuild on the geometry column. At 2,500 UPDATEs/sec, lock contention becomes measurable.
Ride status poll (SELECT ... JOIN)	O(1) — indexed PK lookups on rides and drivers tables	O(1) — constant response size	Fast per-query (15-20ms) but generated at 700 QPS — redundant reads that return unchanged data 80% of the time. WebSocket push eliminates this waste.
Fare calculation (on ride completion)	O(1) — simple arithmetic on ride distance and time	O(1)	Not a performance concern. Base fare + (distance x rate) + (time x rate). No surge multiplier in the naive approach.

Database Schema (HLD)

drivers

Active driver records with PostGIS geometry column for current location. Updated every 4 seconds per active driver (2,500 UPDATEs/sec at 10K drivers). The GiST index on the location column supports ST_DWithin range queries but does not accelerate ORDER BY ST_Distance nearest-neighbor queries.

driver_id UUID PKname TEXTstatus TEXT (available/busy/offline)location GEOMETRY(Point, 4326)heading FLOATspeed FLOATupdated_at TIMESTAMPTZ

Indexes: PK on driver_id, GiST index on location (helps ST_DWithin, not ORDER BY ST_Distance), idx_drivers_status ON (status) WHERE status = 'available'

The location column is the hottest column in the entire database — updated 2,500 times/sec and read by every matching query. Each UPDATE acquires a row-level lock and triggers GiST index maintenance. The partial index on status='available' reduces the scan set but does not eliminate the O(N) distance computation.

rides

Ride records tracking the full lifecycle from request to completion. Written once on ride creation, updated on status transitions (matched, in_progress, completed). Indexed on rider_id for ride history queries and on (status, created_at) for active ride lookups.

ride_id UUID PKrider_id UUID FKdriver_id UUID FK (null until matched)status TEXT (REQUESTED/MATCHED/IN_PROGRESS/COMPLETED)pickup_lat FLOATpickup_lng FLOATdest_lat FLOATdest_lng FLOATfare_cents INTEGERcreated_at TIMESTAMPTZ

Indexes: PK on ride_id, idx_rides_rider ON (rider_id, created_at), idx_rides_status ON (status, created_at) WHERE status IN ('REQUESTED','MATCHED','IN_PROGRESS')

Write volume is low (~300 INSERTs/sec + status UPDATEs). The bottleneck is not the rides table but the drivers table. However, status poll queries (700 QPS) JOIN rides with drivers to get current driver location, compounding database load.

riders

Rider account records with payment method references. Low write volume (account creation only). Read on ride request for account validation and payment method lookup.

rider_id UUID PKname TEXTpayment_method_id TEXTcreated_at TIMESTAMPTZ

Indexes: PK on rider_id

Small table (~100K rows for a single-city deployment). Fully cached in PostgreSQL buffer pool. Not a performance concern.

What-If Scenarios

Driver goes offline mid-ride (app crash, phone dies, tunnel)

Impact

The rider sees stale driver location on their map (last known GPS before disconnect). The ride remains in IN_PROGRESS state indefinitely — no timeout, no heartbeat mechanism. The rider must manually cancel and re-request, losing the fare for the incomplete portion.

Mitigation

Add a heartbeat mechanism: if no GPS update is received from the driver for 30 seconds during an active ride, transition to DRIVER_UNRESPONSIVE. Alert the rider and begin reassignment. The V3 variant implements this as a formal state machine transition with automated escalation.

Two concurrent ride requests match the same driver

Impact

Without distributed locking, both matching queries find the same nearest driver and attempt to INSERT ride records with that driver_id. One succeeds; the other discovers the driver is already assigned when it tries to UPDATE the driver status to 'busy'. The second rider experiences a ~200ms re-match delay. At high request rates, duplicate match rate reaches ~2%.

Mitigation

Use SELECT ... FOR UPDATE on the driver row during matching to acquire an exclusive lock. This serializes concurrent matches for the same driver but adds lock contention. The V1 variant solves this more elegantly with Redis atomic GETSET on driver availability.

GPS drift in tunnels or urban canyons (inaccurate location data)

Impact

Driver location jumps erratically — coordinates may place the driver in a building or on the wrong street. The matching algorithm assigns a ride to a driver who appears close but is actually far away, resulting in long pickup times and rider frustration.

Mitigation

Implement GPS sanity checks: reject location updates with speed > 200 km/h or distance > 500m from last known position (impossible physical movement). Apply Kalman filtering to smooth GPS coordinates. The V3 variant includes location validation in the ingestion pipeline.

Payment failure after ride completion

Impact

The rider's card is declined or the payment gateway times out. In the synchronous model, the monolith retries once and then fails the payment. The ride is marked as COMPLETED but unpaid. The driver does not receive earnings. There is no retry queue, no dead-letter mechanism, and no automated follow-up — a support agent must manually resolve the payment.

Mitigation

The V3 variant implements a payment saga: on failure, retry with exponential backoff (1s, 4s, 16s). After 3 failures, escalate to a support queue with all context (ride details, payment attempt history, error codes). The saga ensures the driver eventually receives payment even if the initial charge fails.

Database failure during rush hour (single point of failure)

Impact

Total system outage — no driver location updates, no ride matching, no status queries. Every active ride loses tracking. New ride requests fail. The system has no fallback, no read replica, and no cached data to serve. Revenue loss is proportional to downtime x peak ride volume.

Mitigation

Add RDS Multi-AZ for automated failover (30-60 seconds recovery). Implement connection pooling via PgBouncer to handle connection storms after failover. The V1 variant separates location storage (Redis GEO) from ride records (PostgreSQL), so driver matching continues even during DB failover.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
PostgreSQL (RideDatabase)	Connection pool exhaustion from concurrent UPDATEs + match queries	All requests fail — no location updates, no ride matching, no status queries. Users see 503 errors. Total system outage because all functionality depends on the single database.	Connection pooling via PgBouncer (transaction mode). Increase max_connections from 200 to 500 as a stopgap. Long-term: separate location storage from ride data using Redis GEO (V1 approach).
Monolith Service	Thread starvation from slow matching queries blocking all threads	If matching queries take 500ms+ (high driver count), all 500 threads (5 pods x 100 threads) can be occupied by slow requests. Location updates and status queries queue behind them. Cascading failure as clients retry timed-out requests.	Separate thread pools for location updates vs ride requests (bulkhead pattern). Set query timeouts on matching queries (500ms max). The V1 variant solves this structurally by separating LocationService (simple, fast) from MatchService (complex, slower).
Redis Session Cache	Cache unavailability	Session validation falls back to database, adding ~15ms per request. Ride status queries bypass cache, adding 700 QPS directly to PostgreSQL. Degraded performance but not a total outage — the cache is an optimization, not a critical path component.	Redis Cluster with automatic failover. Set maxmemory-policy to allkeys-lru. Implement graceful degradation to database reads on cache miss.
Load Balancer	All Monolith health checks fail	ALB returns 502 Bad Gateway. All traffic fails. Users cannot request rides or update locations.	Multi-AZ deployment with at least 2 pods per AZ. Configure health check thresholds to tolerate transient failures (3 consecutive failures before marking unhealthy).
GiST Index	Index corruption from concurrent UPDATE + matching scan contention	Matching queries return incorrect results or crash with internal errors. Rides are matched to incorrect drivers or not matched at all.	REINDEX CONCURRENTLY on the geometry column during maintenance windows. Monitor index corruption via pg_stat_user_indexes. Long-term: move to Redis GEO where read/write paths don't share index structures.

Scaling Strategy

Vertical scaling only for PostgreSQL (upgrade instance size from db.r7g.xlarge to db.r7g.2xlarge to db.r7g.4xlarge). Horizontal scaling for the Monolith via pod count increase (5 -> 10 -> 20 pods). Auto-scaling trigger: CPU utilization > 70% for 3 consecutive minutes. The ceiling is approximately 10K-20K active drivers per city regardless of monolith pod count, because the database O(N) matching scan is the bottleneck. Beyond this ceiling, architectural changes are required: Redis GEO for O(log N) matching (V1) or a full state machine with outbox events (V3).

Monitoring & Alerting

Key metrics to monitor: (1) Matching query latency (p50, p99) — the primary performance indicator. Alert at p99 > 300ms, critical at p99 > 500ms. (2) Location UPDATE rate — should be approximately active_drivers / 4. Drop below expected rate indicates driver app issues or network problems. (3) PostgreSQL active connections — alert at 70% of max_connections (140/200), critical at 85% (170/200). (4) Row lock wait time on drivers table — indicator of UPDATE contention. Alert if mean wait exceeds 20ms. (5) Ride status poll QPS — should be approximately active_rides x 0.25 (one poll per 4 seconds). Higher-than-expected QPS indicates aggressive polling from mobile clients. (6) Duplicate match rate — percentage of ride requests that require re-matching due to concurrent assignment. Alert if exceeds 5%. Dashboard: Grafana with panels for matching latency histogram, location UPDATE throughput, DB connection pool usage, active rides by status, and driver count by status. SLIs: matching p99 < 500ms, location update p99 < 200ms, status query p99 < 100ms.

Cost Analysis

At 10K active drivers (single city): PostgreSQL db.r7g.xlarge (~$350/month), Redis cache.t4g.medium (~$50/month), ECS Fargate 5 pods (~$350/month), ALB (~$30/month). Total: ~$780/month. This is the cheapest variant but breaks down beyond 10K-20K drivers per city. Scaling vertically to db.r7g.2xlarge ($700/month) extends the ceiling to ~30K drivers but does not solve the fundamental O(N) scan problem. The V1 Geo-Indexed variant at 100K drivers costs approximately $2,000/month but handles 10x the driver count — the per-driver cost decreases from $0.078/driver to $0.020/driver as you scale beyond the naive approach's ceiling.

Security Considerations

Rider/driver safety: driver and rider identities verified during signup (photo ID, background check — out of scope for infrastructure but critical for trust). Location privacy: driver GPS coordinates stored in PostgreSQL and visible to matched riders during active rides only — access control enforced at the application layer. Payment security: payment method tokens stored (not raw card numbers) via a PCI-compliant payment gateway. JWT tokens for driver/rider authentication validated on every request (~3ms overhead). Rate limiting: per-user request limiting to prevent location spoofing (drivers faking GPS to appear closer to riders). Anti-fraud: basic velocity checks — flag drivers completing rides impossibly fast or riders requesting rides from impossible locations.

Deployment Strategy

Rolling deployment for the Monolith — replace one pod at a time while the ALB routes traffic to remaining pods. Database migrations run during low-traffic windows (typically 2-4 AM local time) with a brief maintenance window for schema changes requiring table locks. Redis cache is warmed after deployment by pre-loading active session tokens. PostGIS extension updates require database restart and a 30-60 second maintenance window. Zero-downtime deployment achievable for service code changes but not for schema migrations that add/modify GiST indexes.

Real-World Examples

•Uber's original monolith (2010-2013) used MySQL with a similar brute-force nearest-driver query before migrating to a custom geospatial index and eventually the H3 hexagonal grid system
•Lyft's early architecture (2012-2014) used a Python monolith with PostgreSQL/PostGIS before splitting into microservices with a dedicated dispatch service
•Regional ride-hailing startups in Southeast Asia (pre-Grab acquisition) commonly launch with a PostGIS-backed monolith to minimize infrastructure complexity during initial city launches

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Monolith + SQL Distance Sort)	T1	80-200ms match, 50-100ms location update	~10K RPS total	$780/month	Low	99% (single DB)
V1: Geo-Indexed Match (Redis GEO + Kafka)	T2	2ms match, 12ms location update	265K RPS peak	$2,500/month	Medium	99.9% (multi-AZ)
V3: Global Resilient (State Machine + Payment Saga)	T4	<3s match, 15ms location update	280K RPS peak	$6,500/month	Very High	99.99% (multi-region)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why is ride-hailing the most common system design interview question?

Ride-hailing combines five hard distributed systems challenges in one problem: (1) real-time geospatial matching — finding the nearest driver using spatial indexing, (2) high-throughput location streaming — 250K GPS updates/sec at Uber scale, (3) ride lifecycle state machine — 8 states with strict transition rules, (4) payment processing — async saga with compensation for failures, and (5) surge pricing — supply/demand computation per geo cell. Uber, Lyft, DiDi, and Grab ask it because it is their core business. Google, Amazon, and Meta ask it because it tests distributed systems fundamentals (geo-sharding, event sourcing, state machines) without domain-specific knowledge.

Why does the O(N) matching scan fail at scale?

ST_Distance computes the Vincenty formula (great-circle distance) for every available driver row. PostgreSQL cannot short-circuit — it must evaluate all N rows to guarantee finding the true nearest 5 (ORDER BY requires seeing all values). At 10K drivers, this takes ~120ms. At 50K, ~400ms. At 100K, over 1 second. Meanwhile, each scan holds shared locks on multiple pages, blocking concurrent location UPDATEs. A KNN-GiST index helps somewhat but degrades under write contention because each location UPDATE modifies the index structure while matching scans are reading it.

At what scale should you migrate from PostgreSQL to Redis GEO?

Migrate when matching latency p99 exceeds 300ms or when location UPDATE contention causes the database to exceed 70% CPU utilization during peak hours. In this simulation, the inflection point is around 10K-20K drivers per city. Below 10K, PostGIS is simpler and eliminates an infrastructure dependency. Above 20K, the O(N) scan becomes untenable. Redis GEORADIUS handles 1M drivers with 2ms queries — the same result that took 1+ seconds with ST_Distance.

How does the naive approach handle driver going offline mid-ride?

Poorly. The driver stops sending GPS updates, but the system does not know the difference between a driver who turned off the app and a driver passing through a cellular dead zone (tunnel, parking garage). The rider continues polling and sees stale driver location. There is no timeout mechanism, no heartbeat, and no fallback matching. The V3 variant uses a ride state machine with heartbeat timeouts — if the driver stops updating for 30 seconds during an active ride, the system transitions to a DRIVER_UNRESPONSIVE state and begins reassignment.

Why not just add a PostGIS spatial index?

PostGIS supports two types of spatial indexes: GiST (Generalized Search Tree) and SP-GiST. GiST is effective for range queries (ST_DWithin — find all drivers within 5km) but does not accelerate ORDER BY ST_Distance (nearest-neighbor sort). PostgreSQL 9.5+ supports KNN-GiST indexes for nearest-neighbor queries, but these indexes degrade under concurrent writes because each location UPDATE restructures the index tree. At 2,500 UPDATEs/sec, the index is constantly being modified, causing reader stalls. Redis GEO avoids this because reads and writes operate on different parts of the geohash sorted set.

Related Templates

Ride Hailing — Geo-Indexed Match (Redis GEO + Kafka)Ride Hailing — Global Resilient (State Machine + Payment Saga)Ad Click Aggregator — Naive (Single Service + SQL)

Discussion

Ready to design your own Ride Hailing?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator