[BEE-30082] Shadow Mode and Canary Deployment for ML Models
INFO
Shadow deployment duplicates live production traffic to a new model version without returning its responses to users. Canary deployment routes a small percentage of real users to the new version and compares outcomes before full rollout. Together they form a two-stage gate: shadow validates system behavior and prediction quality at zero user risk, canary validates product impact at controlled user risk. The key implementation constraint is that shadow infrastructure MUST observe but never mutate — it cannot write to production databases or trigger downstream side effects.
Context
ML model updates fail in production for reasons that are invisible to offline evaluation. A model that outperforms its predecessor on held-out data may produce higher latency under production concurrency, behave unexpectedly on real request distributions not present in the training set, or produce predictions that — while statistically better — worsen downstream business metrics due to system coupling effects.
Traditional software deployment strategies — blue-green cutover, feature flags — do not address the prediction quality dimension. They validate system correctness (does it start? does it return 200?) but not model correctness (does it predict accurately?). ML deployments require an additional layer: traffic mirroring and comparative analysis to validate model quality before any user is affected.
Uber's ML platform (Michelangelo) manages over 15 million real-time predictions per second at peak across 400+ use cases. Their deployment safety framework computes offline distributional statistics — percentiles, null rates, feature averages — at training time and uses them as the baseline against which production drift is measured during rollout (https://www.uber.com/blog/raising-the-bar-on-ml-model-deployment-safety/). LinkedIn's search team mirrors 10% of ranking queries to shadow candidates and evaluates NDCG@10 before any canary allocation.
Architecture Choices
Four controlled deployment patterns exist for ML models:
| Pattern | User Impact | Latency Added | Validates | Best For |
|---|---|---|---|---|
| Shadow | None | <2ms (async mirror) | System + predictions | Latency, cache warming, prediction comparison |
| Canary | Limited (5–20%) | None | System + predictions + product | ML models, ranking changes |
| Blue-Green | None (cutover) | None | System only | Schema migrations, atomic releases |
| Interleaved | All users, paired | None | Predictions (ranking) | Search, recommendations |
Shadow is zero-risk from a user perspective. The mirrored request is fire-and-forget: the production model returns the response, and the shadow model's output is logged but discarded. Shadow is appropriate for: (a) validating that a new model version stays within the production latency budget at realistic concurrency, (b) warming caches before canary, and (c) accumulating predictions that can be joined with delayed ground truth labels to compute accuracy metrics.
Canary is the gate for product impact. After shadow validates system readiness, canary routes a percentage of real users to the new model and compares outcomes — conversion, click-through, error rate — against the production control group. A mismatch detectable as a statistically significant regression triggers automatic rollback.
Shadow Deployment with Istio
Istio implements shadow mode via the mirror field in VirtualService. Mirrored requests arrive at the shadow service with the Host header appended with -shadow, allowing logs to be distinguished from production traffic. The production response path is never blocked — the mirror is asynchronous.
# shadow-virtualservice.yaml
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: recommendation-model
namespace: ml-serving
spec:
hosts:
- recommendation-model
http:
- route:
- destination:
host: recommendation-model
subset: v1 # production model — 100% of responses to users
weight: 100
mirror:
host: recommendation-model
subset: v2 # shadow candidate — responses discarded
mirrorPercentage:
value: 20.0 # mirror 20% of production traffic
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: recommendation-model
namespace: ml-serving
spec:
host: recommendation-model
subsets:
- name: v1
labels:
version: "2024-q4" # current production
- name: v2
labels:
version: "2025-q1" # shadow candidateThe shadow service MUST be deployed with write-path isolation: no writes to shared databases, no Kafka produce calls, no email sends. A common pattern is to inject an environment variable SHADOW_MODE=true into the shadow deployment and guard all side-effect paths:
import os
import logging
SHADOW_MODE = os.getenv("SHADOW_MODE", "false").lower() == "true"
def record_prediction(user_id: str, prediction: float, model_version: str) -> None:
"""Write prediction to audit log. Skipped in shadow mode."""
if SHADOW_MODE:
logging.info(
"shadow_prediction user_id=%s prediction=%.4f version=%s",
user_id, prediction, model_version,
)
return # do not write to production DB
db.execute(
"INSERT INTO predictions (user_id, prediction, model_version, ts) VALUES (?, ?, ?, NOW())",
(user_id, prediction, model_version),
)Offline Label Joining for Shadow Evaluation
Shadow predictions must be joined with delayed ground truth labels to compute accuracy metrics. Ground truth often arrives hours after the prediction (e.g., a purchase event after a recommendation, a trip completion after an ETA prediction). A label joiner runs as a batch job:
import pandas as pd
from datetime import datetime, timedelta
def join_shadow_predictions_with_labels(
shadow_logs: pd.DataFrame, # columns: request_id, user_id, prediction, ts
ground_truth: pd.DataFrame, # columns: user_id, label, event_ts
max_label_delay_hours: int = 24,
) -> pd.DataFrame:
"""
For each shadow prediction, find the ground truth label that arrived
within max_label_delay_hours after the prediction timestamp.
Uses as-of join to avoid label leakage.
"""
shadow_logs = shadow_logs.sort_values("ts")
ground_truth = ground_truth.sort_values("event_ts")
# Merge on user_id, take the first label event after the prediction
joined = pd.merge_asof(
shadow_logs,
ground_truth,
left_on="ts",
right_on="event_ts",
by="user_id",
direction="forward", # label must arrive AFTER prediction
tolerance=pd.Timedelta(hours=max_label_delay_hours),
)
# Drop rows where no label arrived within the window
joined = joined.dropna(subset=["label"])
return joined
def compute_shadow_metrics(joined: pd.DataFrame) -> dict[str, float]:
from sklearn.metrics import roc_auc_score, average_precision_score
return {
"auc": roc_auc_score(joined["label"], joined["prediction"]),
"avg_precision": average_precision_score(joined["label"], joined["prediction"]),
"n_predictions": len(joined),
"label_join_rate": len(joined) / len(joined), # fraction that got a label
}Canary Rollout with KServe
KServe's canaryTrafficPercent field splits traffic between the last stable version and the current spec. Setting the field to 10 sends 10% of requests to the new model and 90% to the last version that received 100% traffic. Promotion removes the field; rollback sets it to 0.
# kserve-canary.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "fraud-classifier"
namespace: ml-production
annotations:
serving.kserve.io/enable-tag-routing: "true"
spec:
predictor:
canaryTrafficPercent: 10 # 10% canary, 90% to last stable
minReplicas: 2
maxReplicas: 8
model:
modelFormat:
name: sklearn
storageUri: "gs://ml-models/fraud/v2.1.0"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"# Promote: 100% to new version
kubectl patch isvc fraud-classifier -n ml-production \
--type='json' \
-p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'
# Rollback: drain new version immediately
kubectl patch isvc fraud-classifier -n ml-production \
--type='json' \
-p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":0}]'
# Tag-based testing: call canary directly without touching the traffic split
curl -H "Host: latest-fraud-classifier-predictor-default.ml-production.example.com" \
-H "Content-Type: application/json" \
http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/fraud-classifier:predict \
-d @test-payload.jsonAutomated Rollback with Argo Rollouts
Argo Rollouts runs AnalysisTemplate resources at each canary step. When an analysis metric fails failureLimit times, Argo Rollouts initiates an automatic rollback to the previous stable version.
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: ml-model-quality-gate
namespace: ml-production
spec:
args:
- name: service-name
metrics:
- name: prediction-success-rate
interval: 60s
successCondition: result[0] >= 0.95
failureLimit: 3 # rollback after 3 consecutive failures
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(model_predictions_total{
service="{{args.service-name}}",
result="correct"
}[5m])) /
sum(rate(model_predictions_total{
service="{{args.service-name}}"
}[5m]))
- name: p99-latency-ms
interval: 60s
thresholdRange:
max: 200 # rollback if p99 latency exceeds 200ms
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
histogram_quantile(0.99,
sum(rate(model_inference_duration_seconds_bucket{
service="{{args.service-name}}"
}[5m])) by (le)
) * 1000
---
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: churn-predictor
namespace: ml-production
spec:
replicas: 4
selector:
matchLabels:
app: churn-predictor
template:
metadata:
labels:
app: churn-predictor
spec:
containers:
- name: model-server
image: ml-registry/churn-predictor:v3.2.0
ports:
- containerPort: 8080
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m} # bake time at 10%
- analysis:
templates:
- templateName: ml-model-quality-gate
args:
- name: service-name
value: churn-predictor
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100MLflow Champion/Challenger Aliases
As of MLflow 2.9.0, model stages (Staging/Production) are deprecated in favor of mutable aliases. Aliases support arbitrary names — champion, challenger, canary, shadow — and multiple aliases can point to the same version.
import mlflow
from mlflow import MlflowClient
MODEL_NAME = "fraud-classifier"
client = MlflowClient()
def register_challenger(run_id: str, metrics: dict) -> int:
"""Register a new model version as challenger."""
model_uri = f"runs:/{run_id}/model"
mv = mlflow.register_model(model_uri, MODEL_NAME)
version = int(mv.version)
# Tag with evaluation metrics for audit trail
client.set_model_version_tag(MODEL_NAME, str(version), "auc", str(metrics["auc"]))
client.set_model_version_tag(MODEL_NAME, str(version), "eval_date", metrics["date"])
# Assign challenger alias — routing code loads by alias
client.set_registered_model_alias(MODEL_NAME, "challenger", version)
return version
def promote_challenger_to_champion() -> None:
"""Atomically swap challenger → champion."""
challenger_mv = client.get_model_version_by_alias(MODEL_NAME, "challenger")
champion_mv = client.get_model_version_by_alias(MODEL_NAME, "champion")
# Promote
client.set_registered_model_alias(MODEL_NAME, "champion", int(challenger_mv.version))
# Retain rollback alias on old champion
client.set_registered_model_alias(MODEL_NAME, "rollback", int(champion_mv.version))
client.delete_registered_model_alias(MODEL_NAME, "challenger")
# Load by alias in serving code — alias resolves to the correct version
champion = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
challenger = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@challenger")Common Mistakes
Writing to production state from the shadow path. If the shadow service sends emails, updates a recommendation cache, or records an impression, users receive duplicate effects. Every write path in the shadow deployment must be disabled or mocked. The SHADOW_MODE guard at the service level is not sufficient if downstream services are shared — a separate deployment with isolated dependencies is safer.
Running shadow without latency budget validation. The purpose of shadow is not only to compare predictions but to confirm the new model version stays within the production latency budget under realistic concurrency. Run shadow at a traffic percentage that stresses the shadow pod pool — 100% shadow is acceptable if the pods are isolated — before proceeding to canary.
Setting canary windows too short. Diurnal traffic patterns mean that a two-hour canary window may capture only daytime traffic. Prediction quality for a recommendation model may degrade specifically for overnight or weekend sessions. Canary windows SHOULD be at least 24 hours for consumer-facing services.
Using the same rollback thresholds for all models. A fraud model with 0.001% false positive rate needs tighter rollback thresholds than a recommendation model measuring click-through. Define thresholds per-model class based on the business cost of the error type.
Forgetting to drain connections before rollback. An abrupt rollback mid-stream leaves in-flight requests with no response. Use a pre-stop lifecycle hook in Kubernetes to drain existing connections before the pod terminates:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # wait for in-flight requestsRelated BEEs
- BEE-30034 AI Experimentation and Model A/B Testing — statistical experiment design and significance testing for ML
- BEE-30009 LLM Observability and Monitoring — metrics collection for model serving
- BEE-16002 Deployment Strategies — blue-green, rolling, and canary patterns for general software
- BEE-30081 AI Feature Stores for ML Inference — feature infrastructure underlying model serving
- BEE-12007 Rate Limiting and Throttling — traffic control at the serving layer
References
- Istio, "Mirroring," traffic management documentation. https://istio.io/latest/docs/tasks/traffic-management/mirroring/
- KServe, "Canary Rollout," InferenceService documentation. https://kserve.github.io/website/docs/model-serving/predictive-inference/rollout-strategies/canary-example
- Argo Rollouts canary analysis with Prometheus. https://www.infracloud.io/blogs/progressive-delivery-argo-rollouts-canary-analysis/
- Flagger, "Metrics," progressive delivery documentation. https://docs.flagger.app/usage/metrics
- AWS, "Minimize the production impact of ML model updates with Amazon SageMaker shadow testing," Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/minimize-the-production-impact-of-ml-model-updates-with-amazon-sagemaker-shadow-testing/
- Amazon SageMaker, "Shadow variants," developer guide. https://docs.aws.amazon.com/sagemaker/latest/dg/model-shadow-deployment.html
- Uber Engineering, "Raising the Bar on ML Model Deployment Safety." https://www.uber.com/blog/raising-the-bar-on-ml-model-deployment-safety/
- Christopher Samiullah, "Deploying Machine Learning Applications in Shadow Mode," 2019. https://christophergs.com/machine learning/2019/03/30/deploying-machine-learning-applications-in-shadow-mode/
- MLflow, "Model Registry," documentation. https://mlflow.org/docs/latest/model-registry/
- Seldon Core, "Ambassador Shadow Deployments." https://docs.seldon.io/projects/seldon-core/en/latest/examples/ambassador_shadow.html