Vetora logo
Hard9 componentsInterview: Very High

Ride Hailing — Geo-Indexed Match (Redis GEO + Kafka)

Industry-standard ride-hailing architecture using Redis GEO for O(log N) nearest-driver lookups, a dedicated location ingestion service for 250K GPS updates/sec, and Kafka for decoupled ride lifecycle events. Sub-5-second matching with real-time WebSocket tracking.

Redis GEOKafkaWebSocketReal-timeRide Hailing
Problem Statement

The geo-indexed match approach to ride-hailing represents the industry standard architecture used by production ride-hailing services at Uber, Lyft, and Grab. It solves the two fundamental problems with the naive architecture: the O(N) matching scan and the tightly coupled monolith.

The key insight is replacing PostgreSQL's ST_Distance scan with Redis GEORADIUS. Redis GEO stores driver locations in a sorted set using geohash encoding, enabling O(log N + M) nearest-neighbor queries where M is the number of results. At 1M drivers, a GEORADIUS 5km query returns in approximately 2ms — compared to over 1 second with the naive ST_Distance approach. This 500x improvement is the same principle behind Uber's H3 hexagonal grid system.

The architecture separates location ingestion from ride matching. LocationService is a simple, high-throughput service (20 pods) that receives 250K GPS updates/sec from drivers and writes to Redis GEO via GEOADD. MatchService is a complex, lower-throughput service (10 pods) that handles ride requests, performs GEORADIUS queries, scores drivers by estimated arrival time, persists matches to PostgreSQL, and publishes ride events to Kafka. This separation allows each service to scale independently based on its unique resource requirements.

Kafka decouples the ride lifecycle from downstream processing. When a ride is matched, a ride_matched event is published to the ride-lifecycle topic. TrackingWorker consumes these events and pushes real-time updates to riders and drivers via WebSocket. PaymentWorker consumes ride_completed events and processes fares asynchronously. This event-driven architecture means adding new consumers (analytics, fraud detection, surge pricing) requires no changes to MatchService.

WebSocket tracking replaces the naive approach's polling model. Instead of riders polling every 3-5 seconds (generating 700+ redundant QPS), a persistent WebSocket connection pushes driver location updates in real time with sub-2-second freshness. This dramatically reduces database read load and provides a vastly better user experience — the rider sees the car moving smoothly on the map rather than jumping every 3-5 seconds.

The primary trade-off is operational complexity: 9 components instead of 5, Kafka cluster management, Redis cluster monitoring, and WebSocket connection management. But this complexity is justified by the 100x improvement in matching performance and the ability to scale from 10K to 1M active drivers without architectural changes.

Interviewers expect candidates to explain why Redis GEO is superior to PostGIS for real-time driver matching, discuss the separation of location ingestion from ride matching, reason about Kafka's role in decoupling ride events, and analyze the WebSocket model for live tracking.

Architecture Overview

The geo-indexed match architecture uses nine main components organized into three layers: traffic ingestion (DriverClient, RiderClient, ApiGateway, MainLB), application services (LocationService, MatchService, WSTracking), data stores (DriverCache/Redis GEO, RideDB/PostgreSQL, RideEvents/Kafka), and async workers (TrackingWorker, PaymentWorker).

The location ingestion path handles 95% of traffic by volume. DriverClient apps stream GPS coordinates every 4 seconds to the ApiGateway, which authenticates driver JWT tokens (~3ms) and rate-limits at 300K RPS. The MainLB (NLB) distributes traffic to 20 LocationService pods. Each pod parses the GPS payload (~5ms CPU) and writes to DriverCache (Redis GEO) via GEOADD — an O(log N) operation. The entire path from driver app to Redis write completes in approximately 12-15ms. At 1M active drivers, this produces 250K writes/sec. Redis handles this comfortably with a 6-node cluster sharded by city.

The ride matching path handles the high-value 1% of traffic. RiderClient sends a ride request with pickup and destination coordinates. MatchService receives the request via ApiGateway and MainLB, performs GEORADIUS on DriverCache to find drivers within 5km of the pickup point (~2ms), scores candidates by straight-line ETA, selects the best match, persists the ride record to RideDB (PostgreSQL), and publishes a ride_matched event to RideEvents (Kafka). Total end-to-end latency: approximately 50-100ms.

The real-time tracking layer maintains persistent WebSocket connections with both riders and drivers. WSTracking (50 pods x 40K connections/pod = 2M concurrent connections) receives pushed events from TrackingWorker, which consumes ride lifecycle events from Kafka. When a ride status changes (matched, started, completed), TrackingWorker formats the event and pushes it to WSTracking for fan-out to the relevant rider-driver pair. During active rides, the driver's GPS coordinates are also forwarded through the WebSocket for live map tracking with sub-2-second freshness.

PaymentWorker handles async fare processing. When a ride is completed (driver marks arrived), MatchService calculates the fare and publishes a ride_completed event. PaymentWorker consumes this event, charges the rider's payment method, credits the driver's earnings, and updates the ride status to PAID. Payment is fully async — the rider sees 'ride complete' immediately while payment settles in the background.

Horizontal scaling is independent per component. LocationService scales based on GPS throughput (more pods for more drivers). MatchService scales based on ride request volume (more pods for busier cities). Kafka scales via partition count (32 partitions). Redis scales by adding nodes to the cluster.

Architecture Preview
Loading architecture preview...
Request Flow — Geo-Indexed Matching + Async Events

This sequence diagram traces three primary flows: driver GPS ingestion (high throughput, simple), ride matching (complex, multi-step), and real-time tracking (event-driven push). The critical insight is the separation of concerns: LocationService handles the write-heavy GPS stream (250K/sec) independently of MatchService's complex matching logic (1.5K/sec). Kafka decouples ride creation from downstream processing (tracking, payment).

The GEORADIUS query is the key performance differentiator: 2ms at 1M drivers vs 200ms+ with the naive ST_Distance approach. This 100x improvement enables the system to handle 100x more drivers without architectural changes.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Drivers stream GPS coordinates every 4 seconds. API Gateway authenticates the JWT (~3ms) and routes to LocationService. LocationService writes to Redis GEO via GEOADD — O(log N) sorted set insertion. Total path: ~12ms
  2. 2Rider requests a ride. API Gateway routes to MatchService. MatchService performs GEORADIUS on Redis GEO to find the 20 nearest drivers within 5km (~2ms). Results are filtered by availability and scored by estimated arrival time
  3. 3MatchService selects the best driver, creates a ride record in PostgreSQL (INSERT + driver status UPDATE in a single transaction, ~50ms), and publishes a ride_matched event to Kafka (~5ms). Total matching: ~60ms
  4. 4TrackingWorker consumes the ride_matched event from Kafka (~3ms consume latency) and pushes notifications to the rider and driver via the WebSocket service. The rider sees 'driver matched' within 2 seconds of matching completion
  5. 5During the active ride, the driver's GPS updates are forwarded via WebSocket for live map tracking. The rider sees the car moving smoothly with sub-2-second freshness — no polling required
  6. 6On ride completion, MatchService publishes ride_completed with fare details. PaymentWorker consumes this event and processes payment asynchronously. The rider sees 'ride complete' immediately; payment settles in the background (~200-500ms)

Pseudocode

// LOCATION UPDATE — high throughput, simple logic
async function updateDriverLocation(driver_id, city_id, lat, lng):
    // Redis GEOADD: O(log N) — sorted set with geohash score
    await redis.geoadd("driver:" + city_id, lng, lat, driver_id)  // ~2ms
    await redis.hset("driver:" + driver_id + ":meta", {
        status: "available", heading: heading, speed: speed, ts: Date.now()
    })
    return 200  // At 1M drivers: 250K of these per second across 20 pods

// RIDE MATCHING — complex logic, lower throughput
async function requestRide(rider_id, pickup_lat, pickup_lng, dest_lat, dest_lng):
    city_id = getCityFromCoords(pickup_lat, pickup_lng)

    // Step 1: GEORADIUS — O(log N + M), ~2ms at 1M drivers
    candidates = await redis.georadius(
        "driver:" + city_id, pickup_lng, pickup_lat, 5, "km",
        "WITHCOORD", "WITHDIST", "COUNT", 20, "ASC"
    )

    // Step 2: Filter + score candidates
    available = candidates.filter(d => redis.hget(d.id + ":meta", "status") == "available")
    best = available.sort(d => d.distance)[0]  // Nearest by straight-line

    // Step 3: Persist match + publish event (transaction)
    ride_id = uuid()
    await db.begin()
    await db.execute("INSERT INTO rides (...) VALUES (...)", [ride_id, rider_id, best.id, ...])
    await db.execute("UPDATE drivers SET status='busy' WHERE driver_id=$1", [best.id])
    await db.commit()  // ~50ms

    await kafka.publish("ride-lifecycle", ride_id, {
        event_type: "MATCHED", ride_id, driver_id: best.id, rider_id
    })  // ~5ms

    return { ride_id, driver: best }
Data Storage Architecture

The V1 architecture uses three distinct data stores, each optimized for its specific access pattern. Redis GEO handles the real-time geospatial workload (250K writes/sec, sub-2ms reads). PostgreSQL handles transactional ride records requiring strong consistency. Kafka provides durable event streaming for async processing. This separation of concerns is the key architectural improvement over the naive approach's single-database design.

The Redis GEO sorted set uses geohash encoding to turn 2D coordinates into a 1D sorted set score. GEOADD computes the geohash and inserts into the sorted set (O(log N)). GEORADIUS converts the search circle into a geohash prefix range and scans the sorted set (O(log N + M)). This is fundamentally more efficient than PostGIS ST_Distance which must compute distance for every row.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Redis GEO stores driver locations as geohash-encoded sorted set entries. GEOADD is O(log N); GEORADIUS performs prefix-based range scan for O(log N + M) nearest-neighbor queries. 30-second TTL auto-expires stale drivers
  2. 2Driver metadata (status, heading, speed) is stored in Redis hashes supplementing the GEO set. MatchService filters GEORADIUS results by checking availability status in the hash
  3. 3The rides table in PostgreSQL provides strong consistency for transactional integrity. The ride INSERT and driver status UPDATE are wrapped in a transaction to prevent double-matching
  4. 4Kafka ride-lifecycle topic stores every ride event with ride_id as partition key. This guarantees per-ride ordering: MATCHED always precedes STARTED which always precedes COMPLETED
  5. 5Consumer groups on Kafka (TrackingWorker, PaymentWorker) process events independently. Adding new consumers (analytics, fraud detection) requires no changes to the producer (MatchService)

Pseudocode

-- REDIS GEO: Driver locations (250K writes/sec)
-- Internally stored as sorted set with geohash scores
GEOADD driver:nyc -73.985 40.748 "driver_abc123"  -- O(log N)
GEORADIUS driver:nyc -73.985 40.748 5 km WITHCOORD WITHDIST COUNT 20 ASC  -- O(log N + M)

-- REDIS HASH: Driver metadata
HSET driver:driver_abc123:meta status available heading 90 speed 30

-- POSTGRESQL: Ride records (strong consistency)
CREATE TABLE rides (
    ride_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    rider_id UUID NOT NULL,
    driver_id UUID,
    status TEXT NOT NULL DEFAULT 'MATCHING',
    pickup_lat FLOAT NOT NULL,
    pickup_lng FLOAT NOT NULL,
    dest_lat FLOAT NOT NULL,
    dest_lng FLOAT NOT NULL,
    fare_cents INTEGER,
    surge_multiplier FLOAT DEFAULT 1.0,
    created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_rides_status ON rides (status) WHERE status IN ('MATCHING', 'MATCHED', 'IN_PROGRESS');

-- KAFKA: Ride lifecycle events (32 partitions, key=ride_id)
-- Topic: ride-lifecycle
-- Retention: 7 days
-- Consumer groups: tracking-workers, payment-workers
Key Design Decisions
Redis GEO for Driver Location Index

Choice

GEOADD/GEORADIUS instead of PostgreSQL PostGIS

Rationale

Redis GEO operates in O(log N + M) where M is the result count. For 1M drivers, a GEORADIUS 5km query returns in approximately 2ms. PostgreSQL ST_Distance with ORDER BY requires O(N) — computing distance for every driver. The 100-500x speedup comes from geohash-based indexing: drivers are stored in a sorted set with geohash-encoded coordinates, enabling prefix-based range scans instead of full table scans. This is conceptually identical to Uber's H3 hexagonal index.

Separate LocationService and MatchService

Choice

Independent microservices for GPS ingestion and ride matching

Rationale

LocationService handles 250K writes/sec (high throughput, simple logic — parse GPS, GEOADD to Redis). MatchService handles 1.5K rides/sec (lower throughput, complex logic — GEORADIUS + scoring + DB write + Kafka publish). Separating them allows independent scaling: 20 location pods vs 10 match pods. A combined service would need 30 pods all sized for the heavier match logic, wasting resources. Location updates should never be slowed by ride matching operations.

Kafka for Ride Lifecycle Events

Choice

Kafka topic with ride_id partition key for ordered event processing

Rationale

The ride lifecycle has multiple consumers: TrackingWorker (live WebSocket updates), PaymentWorker (fare processing), and potentially analytics, fraud detection, and surge pricing. Kafka's pub-sub model decouples producers from consumers. If payment processing is slow, it does not block the rider's 'ride complete' screen. Partitioning by ride_id guarantees event ordering within a single ride (matched before started before completed).

WebSocket for Live Ride Tracking

Choice

Persistent WebSocket connections instead of client polling

Rationale

During an active ride, the rider needs to see the driver's location update every 2-4 seconds (the 'moving car on the map' experience). Polling via REST would mean 500K GET requests/sec for 1M active rides — wasteful and laggy. WebSocket pushes updates only when the driver moves, using approximately 10x less bandwidth. Mobile battery life depends on this — polling forces the radio to wake every few seconds, while WebSocket maintains an idle connection that consumes minimal power.

Async Payment via Kafka

Choice

Payment processed by a worker consuming ride_completed events

Rationale

Payment gateway calls take 200-500ms and fail 2-5% of the time (card declined, network timeout, insufficient funds). If payment blocks the ride-end flow, the rider stares at a spinner while the driver waits. Async payment via Kafka ensures the ride ends immediately, and payment settles in the background. Failed payments are retried by the PaymentWorker with exponential backoff.

Straight-Line ETA for Driver Scoring

Choice

Score matched drivers by Haversine distance, not road-network ETA

Rationale

Road-network ETA requires integration with a routing API (Google Maps, OSRM), adding 50-100ms per candidate driver. With 5 candidates, this adds 250-500ms to the matching path. Straight-line distance is computed in microseconds and correlates well with road-network distance in dense urban areas (correlation > 0.85). Production systems like Uber use road-network ETA but batch the API calls to amortize latency.

Scale & Performance

Target RPS

265K peak (250K location + 1.5K rides + 13K status/tracking)

Latency (p99)

~2ms match (GEORADIUS), ~12ms location write, ~50ms ride creation

Storage

~500 GB/year (rides, events, location history)

Availability

99.9% (multi-AZ, no multi-region)

Time & Space Complexity
OperationTimeSpaceNotes
Driver matching (GEORADIUS)O(log N + M) — geohash prefix scan + M resultsO(M) — M matched drivers returnedRedis GEORADIUS scans the geohash-encoded sorted set. At 1M drivers, a 5km radius query returns in ~2ms with ~20 candidates. Compare with O(N) ST_Distance in the naive variant: 500x faster at 100K drivers.
Location update (GEOADD)O(log N) — sorted set insertion with geohash scoreO(1) — single member updateRedis GEOADD is an O(log N) sorted set operation. At 250K/sec on 6 shards, each shard handles ~42K ops/sec — well within Redis throughput limits (100K+ ops/sec per node).
Ride event publishing (Kafka produce)O(1) — append to partition logO(1) — single message (~1KB)Kafka append is O(1) amortized. Partition selection by ride_id hash is O(1). Total publish latency: ~5ms including network.
WebSocket fan-out (TrackingWorker -> WSTracking)O(1) — push to rider + driver pairO(C) — C concurrent WebSocket connectionsEach ride has exactly 2 WebSocket recipients (rider + driver). Fan-out is O(1) per ride event. The 2M concurrent connections consume approximately 8 GB memory across 50 WSTracking pods.
Database Schema (HLD)
drivers (Redis GEO + Hash)

Driver locations stored in Redis GEO sorted set for sub-2ms nearest-neighbor queries. Supplementary driver metadata stored in Redis hashes. TTL of 30 seconds auto-expires drivers who stop sending GPS updates.

GEO key: driver:{city_id} (geospatial sorted set)GEO member: {driver_id}GEO coordinates: (longitude, latitude)HASH key: driver:{driver_id}:metaHASH fields: status, heading, speed, vehicle_type, last_update

Indexes: Geohash-encoded sorted set (O(log N) for GEOADD, GEORADIUS)

Redis GEO uses a 52-bit geohash to encode coordinates into sorted set scores. GEORADIUS performs a prefix-based range scan on this sorted set, achieving O(log N + M) query time. At 1M drivers with 6 shards, each shard holds approximately 170K entries — well within Redis memory limits.

rides (PostgreSQL)

Ride records stored in PostgreSQL with strong consistency for transactional integrity. Partitioned by city_id across 32 shards. Write path: ride creation (1.5K/sec) + fare update (1.5K/sec) = 3K writes/sec. Read path: ride status queries (10K reads/sec).

ride_id UUID PKrider_id UUID FKdriver_id UUID FKstatus TEXT (MATCHING/MATCHED/IN_PROGRESS/COMPLETED/PAID)pickup_lat FLOATpickup_lng FLOATdest_lat FLOATdest_lng FLOATfare_cents INTEGERsurge_multiplier FLOATcreated_at TIMESTAMPTZ

Indexes: PK on ride_id, idx_rides_rider ON (rider_id, created_at), idx_rides_status ON (status) WHERE status IN ('MATCHING', 'MATCHED', 'IN_PROGRESS')

Strong consistency prevents double-matching: the ride INSERT and driver status UPDATE are wrapped in a transaction. If two concurrent ride requests match the same driver, the second transaction fails and triggers re-matching.

ride_events (Kafka topic)

Kafka topic carrying ride lifecycle events: REQUESTED, MATCHED, STARTED, COMPLETED. Partitioned by ride_id (32 partitions) to guarantee ordering within a single ride. Retained for 7 days for replay capability.

ride_id TEXT (partition key)event_type TEXT (REQUESTED/MATCHED/STARTED/COMPLETED)driver_id TEXTrider_id TEXTfare_cents INTEGER (on COMPLETED)timestamp BIGINT

Indexes: Partitioned by ride_id (32 partitions)

Two consumer groups: TrackingWorker (real-time WebSocket push) and PaymentWorker (async fare processing). Consumer lag is the key operational metric — lag > 5 seconds means riders receive delayed tracking updates.

Event Contracts
ride_lifecycleride-lifecycle

Ride lifecycle events published by MatchService on every ride state change. Consumed by TrackingWorker (WebSocket push) and PaymentWorker (fare processing). Partitioned by ride_id for per-ride ordering.

Key Schema

ride_id (string)

Value Schema

{ ride_id: string, event_type: REQUESTED|MATCHED|STARTED|COMPLETED, driver_id?: string, rider_id: string, fare_cents?: number, surge_multiplier?: number, timestamp: number }

What-If Scenarios

Redis GEO cluster node failure (1 of 6 nodes goes down)

Impact

Drivers in the affected city shard are not found by GEORADIUS queries — rides in that city cannot be matched. Location updates for those drivers fail with connection errors. Other cities are unaffected (sharded by city).

Mitigation

Redis Cluster with automatic failover promotes a replica to primary within 15-30 seconds. During failover, MatchService retries GEORADIUS with exponential backoff. Location updates are buffered in LocationService's local queue (2000 message limit) and replayed on recovery.

Kafka consumer lag spikes during peak hour (TrackingWorker falls behind)

Impact

Riders receive delayed tracking updates — the car on the map jumps instead of moving smoothly. Payment processing delays mean drivers see 'payment pending' for minutes instead of seconds. User experience degrades but rides continue to be matched (matching does not depend on Kafka).

Mitigation

Auto-scale TrackingWorker pods based on consumer lag metric. Increase Kafka partition count from 32 to 64 to enable more parallel consumers. Set consumer lag alert threshold at 5 seconds for TrackingWorker and 30 seconds for PaymentWorker.

GPS drift in tunnels or urban canyons (inaccurate driver locations)

Impact

GEORADIUS returns drivers who appear close but are actually far away (GPS reports them on the wrong street or inside a building). Riders experience long wait times because the matched driver is farther than expected.

Mitigation

Implement GPS sanity checks in LocationService: reject updates where speed > 200 km/h or distance from last known position > 500m (impossible movement in 4 seconds). Apply Kalman filtering to smooth GPS coordinates. Flag drivers with consistently poor GPS accuracy.

Surge pricing fairness concern (2x multiplier during natural disaster)

Impact

Riders in emergency situations face 2-3x fares. Public backlash and potential regulatory action (several jurisdictions have price gouging laws). Media coverage damages brand reputation.

Mitigation

Implement surge price caps (e.g., maximum 2x during declared emergencies). Geofence emergency zones and disable surge pricing within them. This is a policy decision implemented as a MatchService configuration: max_surge_multiplier_emergency = 1.0.

Driver cancellation after matching (driver declines the ride)

Impact

In the V1 architecture, there is no driver acceptance flow — rides are auto-dispatched. If a driver ignores the ride assignment, the rider waits indefinitely. There is no timeout, no cascade to the next driver, and no penalty for the driver.

Mitigation

Implement a 30-second acceptance timeout: if the driver does not confirm pickup within 30 seconds, automatically reassign to the next nearest driver. This requires the ride state machine from V3 — a formal OFFERED -> ACCEPTED/TIMEOUT state transition that the V1 variant lacks.

Failure Modes & Resilience
ComponentFailureImpactMitigation
Redis GEO (DriverCache)Memory exhaustion from driver count exceeding shard capacityGEOADD operations fail with OOM errors. New driver locations are not indexed, causing GEORADIUS to return stale or incomplete results. Ride matching degrades — matched drivers may be farther away than optimal.Monitor Redis memory utilization. Alert at 70% of maxmemory. Scale by adding shards (split city into sub-regions). Each driver entry consumes approximately 100 bytes — 1M drivers requires only ~100MB, so OOM usually indicates a memory leak or misconfiguration rather than capacity limits.
LocationServicePod crash under 250K RPS GPS update loadGPS updates for affected drivers are dropped. Driver positions in Redis GEO become stale (up to TTL of 30 seconds). If all LocationService pods crash, all driver locations expire within 30 seconds and GEORADIUS returns no results.Auto-scale LocationService pods based on CPU and request queue depth. Minimum 5 pods for redundancy. Health check endpoint that validates Redis connectivity. Circuit breaker on Redis writes to prevent cascading failure if Redis is slow.
MatchServiceGEORADIUS returns zero results (no available drivers nearby)Ride request fails with 'no drivers available' error. Rider must retry manually. During off-peak hours in suburban areas, this can affect 10-20% of ride requests.Implement progressive radius expansion: if GEORADIUS 5km returns zero results, retry at 10km, then 20km. Set maximum search radius at 30km. If still no results, return estimated wait time based on nearest driver distance and suggest the rider try again in 5 minutes.
Kafka (RideEvents)Broker failure causing partition unavailabilityEvents for rides hashed to the failed partition's range are not published. TrackingWorker cannot push updates for those rides. PaymentWorker cannot process their fares. The rides themselves are still created in PostgreSQL.MSK with 3-way replication (replication factor = 3). In-sync replicas = 2 ensures no data loss on single broker failure. MatchService implements local event buffering with retry on Kafka unavailability.
WebSocket Service (WSTracking)Connection storm after deployment (50 pods restart, 2M reconnections)All 2M WebSocket connections drop simultaneously. Clients reconnect immediately, creating a thundering herd. WSTracking pods may OOM from connection setup overhead before reaching steady state.Rolling deployment (restart 2-3 pods at a time, not all 50). Client-side exponential backoff on reconnection (jitter 0-5 seconds). Connection rate limiting on WSTracking (max 1000 new connections/sec per pod).
Scaling Strategy

LocationService scales horizontally based on GPS update throughput: 20 pods for 250K/sec, add 10 pods per additional 125K/sec. MatchService scales based on ride request volume: 10 pods for 1.5K/sec, add 5 pods per additional 750/sec. Redis GEO scales by adding cluster shards: 6 shards for 1M drivers, 12 shards for 2M drivers. Kafka scales by adding partitions (32 -> 64 -> 128) and broker nodes. WSTracking scales based on connection count: 50 pods for 2M connections, add 10 pods per additional 400K connections. Auto-scaling triggers: CPU > 70% for 3 minutes (services), memory > 70% (Redis), consumer lag > 10s (Kafka workers). The architecture scales linearly to ~5M active drivers without architectural changes.

Monitoring & Alerting

Key metrics to monitor: (1) GEORADIUS latency (p50, p99) — should be <5ms; alert at >10ms indicating Redis overload or network issues. (2) Location update throughput — should match expected_drivers / 4; drop indicates driver app issues or LocationService failures. (3) Kafka consumer lag for TrackingWorker — alert at >5s (delayed tracking), critical at >30s. (4) Kafka consumer lag for PaymentWorker — alert at >30s (delayed payment), critical at >5min. (5) WebSocket connection count — should be approximately 2 x active_rides; sudden drops indicate WSTracking failures. (6) Match success rate — percentage of ride requests successfully matched within 30 seconds; alert if <90%. (7) Redis memory utilization per shard — alert at 70%, critical at 85%. Dashboard: Grafana with panels for GEORADIUS latency histogram, location update throughput by city, Kafka consumer lag per group, WebSocket connection count, match latency distribution, and ride status breakdown. SLIs: match latency p99 < 5s, location update p99 < 30ms, tracking freshness < 2s, payment completion < 5s.

Cost Analysis

At 100K active drivers (multi-city): Redis Cluster 6 nodes cache.r7g.xlarge (~$900/month), PostgreSQL db.r7g.xlarge (~$350/month), MSK Kafka kafka.m7g.large (~$400/month), ECS Fargate 20 LocationService + 10 MatchService pods (~$450/month), WebSocket API Gateway (~$150/month), ECS Workers 30 pods (~$250/month). Total: ~$2,500/month. At 1M active drivers, scale Redis to 12 nodes ($1,800), Kafka to larger brokers ($800), and pods to 2x ($1,400). Total: ~$5,000/month. Per-driver cost drops from $0.025/driver at 100K to $0.005/driver at 1M — economies of scale from Redis GEO's O(log N) matching.

Security Considerations

Rider/driver safety: GPS coordinates are sensitive PII — stored in Redis with TTL (auto-expire) and in PostgreSQL with access control. Location data access restricted to the matched rider/driver pair during active rides only. Payment security: payment method tokens stored via PCI-compliant gateway (Stripe, Braintree) — no raw card numbers in the system. JWT authentication: API Gateway validates tokens on every request (~3ms); tokens expire after 24 hours for riders, 8 hours for drivers. Rate limiting: per-driver location update rate limited to 1 request per 3 seconds (prevents GPS spoofing). Anti-fraud: velocity checks flag impossibly fast rides or location jumps. WebSocket authentication: connection upgrade requires valid JWT; connections are terminated on token expiry.

Deployment Strategy

Blue-green deployment for stateless services (LocationService, MatchService). Rolling deployment for WebSocket service (2-3 pods at a time to avoid mass reconnection). Kafka topic changes (partition count increase) performed during low-traffic windows. Redis Cluster scaling (adding shards) performed live with automatic key migration. PostgreSQL schema migrations with zero-downtime DDL (CREATE INDEX CONCURRENTLY, no table locks). Canary deployment for MatchService changes: route 5% of traffic to new version, monitor match success rate and latency, promote to 100% after 30 minutes of stable metrics.

Real-World Examples
  • Uber's early microservices architecture (2014-2017) used Redis for geospatial indexing and Kafka for ride events before migrating to the custom H3 hexagonal grid and Apache Flink for stream processing
  • Lyft's dispatch service uses a combination of Redis GEO and machine learning models for driver-rider matching, with Kafka for event streaming and AWS for infrastructure
  • Grab's Southeast Asian ride-hailing platform uses a similar architecture with Redis GEO for driver indexing, partitioned by city, handling 1M+ active drivers across 8 countries
  • DiDi's core matching system in China processes 30M+ rides daily using a geo-indexed architecture with custom sharding by administrative district within each city
Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Monolith + SQL Distance Sort)T180-200ms match, 50-100ms location update~10K RPS total$780/monthLow99% (single DB)
V1: Geo-Indexed Match (Redis GEO + Kafka)T22ms match, 12ms location update265K RPS peak$2,500/monthMedium99.9% (multi-AZ)
V3: Global Resilient (State Machine + Payment Saga)T4<3s match, 15ms location update280K RPS peak$6,500/monthVery High99.99% (multi-region)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why Redis GEO instead of Elasticsearch or a custom geohash index?

Redis GEO provides the simplest operational model for real-time geospatial queries. GEOADD and GEORADIUS are native commands — no plugins, no mapping configuration, no cluster coordination. Elasticsearch GEO supports richer queries (polygon intersections, geoshape indexing) but adds significant operational complexity and 10-50ms query latency. A custom geohash index in application code requires implementing sorted set operations, TTL management, and sharding — all of which Redis provides natively. At the scale of ride-hailing (1M drivers, 250K updates/sec), Redis GEO handles the workload on a 6-node cluster with sub-2ms queries.

How does Redis GEO handle driver availability (only match available drivers)?

Redis GEO does not natively support filtering — GEORADIUS returns all members within a radius, including busy and offline drivers. The application layer handles filtering: after GEORADIUS returns the nearest 20 candidates, MatchService checks each driver's availability status (stored in a separate Redis hash or in the driver object metadata). In practice, 60-70% of drivers are available at any time, so filtering 20 candidates to find 5 available drivers is efficient. Production systems like Uber maintain separate GEO sets per status (available_drivers, busy_drivers) to avoid post-filtering.

What happens when Kafka is down during ride creation?

The ride is still created in PostgreSQL (the critical path does not depend on Kafka), but the ride_matched event is not published. This means TrackingWorker cannot push the 'driver matched' notification to the rider via WebSocket, and PaymentWorker will not process the fare when the ride completes. The MatchService should buffer events locally and retry Kafka publishes. On recovery, Kafka consumers resume from the last committed offset, processing all buffered events. The V3 variant solves this definitively with the outbox pattern — events are written to PostgreSQL in the same transaction as the ride record, guaranteeing no event loss.

How does surge pricing work in this architecture?

MatchService tracks supply/demand ratio per geo cell. Available drivers per cell come from GEORADIUS counts on Redis GEO. Pending ride requests per cell come from recent ride_requested events in Kafka. When demand exceeds supply by 2x or more, a surge multiplier is applied to the fare. The multiplier is a config parameter on MatchService, cached with a 60-second TTL. This is a simplified approach — production systems use dedicated surge pricing services with more granular geo cells (H3 resolution 7, approximately 5 square km hexagons) and ML-based demand prediction.

Why is there no driver acceptance flow in this variant?

This design auto-dispatches the best driver — no acceptance prompt, no timeout cascade. Real production systems send a ride offer to the driver and wait 15-30 seconds for acceptance. On timeout, the offer cascades to the next nearest driver. This requires a state machine (OFFERED -> ACCEPTED/TIMEOUT -> next candidate) not modeled in the V1 variant. The V3 Global Resilient variant implements a full ride state machine with 8 states including MATCHING (offer sent, awaiting acceptance).

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Ride Hailing?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator