Vetora logo
๐Ÿ“ฆAI / ML Infrastructure

ML Model Registry

An ML model registry is a centralized store for versioned model artifacts, metadata, and lifecycle state. It is the 'source of truth' that connects training pipelines to serving infrastructure, enabling reproducibility, auditability, and governance across the model lifecycle from experiment to production to retirement.

Overview

A model registry is to ML what a container registry is to software: the canonical store from which production systems pull artifacts. Without a registry, models are ad-hoc files on S3, shared via Slack, with no versioning, no lineage, and no rollback capability. When a production model misbehaves, the questions 'which version is running?', 'what data was it trained on?', and 'what was the previous version?' become impossible to answer.

The core abstraction in a model registry is the registered model, which is a named entity (e.g., 'search-ranking-v2') with multiple versions. Each version is an immutable artifact bundle: the serialized model weights, the preprocessing code or pipeline, a model signature (input/output schema), and metadata including training data snapshot ID, hyperparameters, evaluation metrics, the code commit hash, and the training pipeline run ID. Immutability is critical -- you must never overwrite a version, only create new ones.

Lifecycle management tracks each version through stages: 'Experimental' (created during development), 'Staging' (passed automated quality gates, awaiting human review), 'Production' (actively serving traffic), and 'Archived' (retired, retained for audit). Promotion between stages can be automated (if evaluation metrics exceed thresholds) or require manual approval (for high-risk models like credit scoring). This mirrors the software release process but adds ML-specific gates: accuracy, calibration, fairness metrics, and data lineage checks.

Lineage tracking is the model registry's most valuable feature for regulated industries. A complete lineage record traces from a production prediction back to: the model version, the training run, the training data snapshot, the feature transformations, and the code commit. In financial services, healthcare, and other regulated domains, this lineage is required for audit and compliance. Even in unregulated domains, lineage is essential for debugging: 'this model's accuracy degraded because training data version 47 contained a schema change that corrupted the user_age feature.'

Key Points
  • 1Every model version is an immutable artifact bundle: weights + preprocessing code + model signature + metadata (training data ID, hyperparameters, metrics, code commit). Never overwrite a version; create a new one.
  • 2Lifecycle stages (Experimental -> Staging -> Production -> Archived) with promotion gates ensure that only validated models reach production. Automated gates check metrics; human gates add review for high-risk models.
  • 3Model lineage traces from a production prediction to the model version, training run, data snapshot, feature definitions, and code commit. This is essential for debugging, reproducibility, and regulatory compliance.
  • 4Model signatures define input and output schemas (feature names, types, shapes). The serving infrastructure validates requests against the signature, catching schema mismatches before they cause silent prediction errors.
  • 5Rollback requires the previous model version to remain loaded or quickly loadable. Best practice is to keep the N-1 version warm in the serving fleet so rollback is a traffic routing change (seconds), not a model load (minutes).
  • 6The registry integrates with CI/CD: training pipelines register models, evaluation pipelines promote models through stages, and serving infrastructure pulls the latest 'Production' version. This automation eliminates manual deployment steps.
Simple Example

Deploying a New Recommendation Model

A data scientist trains a new recommendation model and registers it as 'reco-model' version 14 in the model registry. The registration includes the model weights (2.3 GB), the feature preprocessing pipeline (Python pickle), evaluation metrics (NDCG@10 = 0.42, up from 0.40 in version 13), and metadata (trained on data snapshot 2026-06-01, code commit abc123). An automated pipeline promotes version 14 to 'Staging' because NDCG improved. A human reviewer approves promotion to 'Production' after checking fairness metrics. The serving infrastructure detects the new Production version, loads it into a canary replica, and gradually shifts traffic from version 13 to version 14 over 2 hours while monitoring prediction quality.

Real-World Examples

Uber (Michelangelo)

Uber's Michelangelo model registry stores thousands of model versions across hundreds of models for ride pricing, ETA estimation, fraud detection, and driver matching. Every model version includes a pointer to the training data snapshot, the feature store version, and the evaluation report. Models require automated evaluation and team lead approval before promotion to Production. Rollback to the previous version takes under 60 seconds via traffic routing.

Netflix

Netflix's model registry integrates with Metaflow (training), Meson (scheduling), and their A/B testing platform. Each model version records not just metrics but also the A/B test results that validated it. Models go through a staged rollout: shadow mode (predictions logged but not shown), canary (5% traffic), and full rollout. The registry retains the full history, enabling Netflix to analyze how model quality has evolved over years.

Google (Vertex AI Model Registry)

Vertex AI Model Registry supports model versioning, deployment to endpoints with traffic splitting, and integration with Vertex AI Pipelines for automated training-to-deployment workflows. It stores model artifacts in Google Cloud Storage, supports custom metadata labels for governance, and provides an API for querying model lineage. Google uses a similar internal registry for serving models across Search, Ads, and YouTube.

Trade-Offs
AspectDescription
Strict Governance vs. Iteration SpeedRequiring manual approval and extensive evaluation for every model version increases safety but slows deployment. Fast-moving teams (recommendations, personalization) may need automated promotion with only metric gates. High-risk models (fraud, credit, medical) require human review and documentation. A tiered governance model matches rigor to risk.
Centralized vs. Federated RegistryA single organization-wide registry enables cross-team model discovery and consistent governance but can become a bottleneck with complex access controls. Per-team registries are simpler but prevent reuse and create governance blind spots. Most organizations use a centralized registry with team-level namespaces.
Artifact Size vs. ReproducibilityStoring full model artifacts (7-140 GB for LLMs) in the registry is expensive but enables instant deployment. Storing only metadata + a pointer to external storage (S3, GCS) is cheaper but adds dependency on the external store's availability and consistency.
Case Study

Airbnb's ML Model Governance Platform

Scenario

Airbnb had 100+ ML models in production with no centralized tracking. A pricing model update caused a revenue drop of $2M before the team realized the new model had been trained on a data snapshot with a known data quality issue. There was no way to quickly identify which model version was running, what data it was trained on, or how to rollback.

Solution

Airbnb built a model governance platform centered on a registry that tracked every model version's lineage (data snapshot, feature store version, code commit, evaluation metrics). Promotion to Production required passing automated quality gates (accuracy, calibration, fairness) and manager approval. The serving layer maintained the previous version warm for instant rollback.

Outcome

Model-related incidents decreased 70%. Mean time to rollback dropped from 2 hours (manual redeployment) to 90 seconds (traffic routing switch). The data quality issue that caused the $2M loss would have been caught by the automated data snapshot validation gate. Cross-team model reuse increased because data scientists could discover and understand existing models through the registry's catalog.

Common Mistakes
  • โš No model versioning -- overwriting the production model in place. When the new version has a bug, there is nothing to rollback to. The 'fix' is to retrain, which takes hours. Always version models immutably and keep the previous production version available for instant rollback.
  • โš Storing model weights without metadata. A model file on S3 without training data version, hyperparameters, and evaluation metrics is undebuggable. Six months later, no one knows what data or code produced it. Require metadata as a mandatory part of model registration.
  • โš No model signature validation. Without an input schema definition, a serving system may silently accept requests with missing features, wrong types, or swapped column order, producing garbage predictions with 200 OK responses. Define and enforce model signatures.
  • โš Treating model governance as optional. In regulated industries, deploying a model without lineage, evaluation records, and approval workflows can result in regulatory penalties. Even in unregulated domains, ungoverned models cause costly silent failures.
Related Concepts

See ML Model Registry in action

Explore system design templates that use ml model registry and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Track model versions and rollback latency

Metrics to watch
model_load_time_msrollback_time_msversion_countserving_error_rate_pct
Run Simulation
Test Your Understanding

1What is the primary purpose of model lineage tracking in a model registry?

2Why should model versions in a registry be immutable?

Deeper Reading