Skip to content
BEE
Backend Engineering Essentials

[BEE-30082] Shadow Mode and Canary Deployment for ML Models

INFO

Shadow deployment duplicates live production traffic to a new model version without returning its responses to users. Canary deployment routes a small percentage of real users to the new version and compares outcomes before full rollout. Together they form a two-stage gate: shadow validates system behavior and prediction quality at zero user risk, canary validates product impact at controlled user risk. The key implementation constraint is that shadow infrastructure MUST observe but never mutate — it cannot write to production databases or trigger downstream side effects.

Context

ML model updates fail in production for reasons that are invisible to offline evaluation. A model that outperforms its predecessor on held-out data may produce higher latency under production concurrency, behave unexpectedly on real request distributions not present in the training set, or produce predictions that — while statistically better — worsen downstream business metrics due to system coupling effects.

Traditional software deployment strategies — blue-green cutover, feature flags — do not address the prediction quality dimension. They validate system correctness (does it start? does it return 200?) but not model correctness (does it predict accurately?). ML deployments require an additional layer: traffic mirroring and comparative analysis to validate model quality before any user is affected.

Uber's ML platform (Michelangelo) manages over 15 million real-time predictions per second at peak across 400+ use cases. Their deployment safety framework computes offline distributional statistics — percentiles, null rates, feature averages — at training time and uses them as the baseline against which production drift is measured during rollout (https://www.uber.com/blog/raising-the-bar-on-ml-model-deployment-safety/). LinkedIn's search team mirrors 10% of ranking queries to shadow candidates and evaluates NDCG@10 before any canary allocation.

Architecture Choices

Four controlled deployment patterns exist for ML models:

PatternUser ImpactLatency AddedValidatesBest For
ShadowNone<2ms (async mirror)System + predictionsLatency, cache warming, prediction comparison
CanaryLimited (5–20%)NoneSystem + predictions + productML models, ranking changes
Blue-GreenNone (cutover)NoneSystem onlySchema migrations, atomic releases
InterleavedAll users, pairedNonePredictions (ranking)Search, recommendations

Shadow is zero-risk from a user perspective. The mirrored request is fire-and-forget: the production model returns the response, and the shadow model's output is logged but discarded. Shadow is appropriate for: (a) validating that a new model version stays within the production latency budget at realistic concurrency, (b) warming caches before canary, and (c) accumulating predictions that can be joined with delayed ground truth labels to compute accuracy metrics.

Canary is the gate for product impact. After shadow validates system readiness, canary routes a percentage of real users to the new model and compares outcomes — conversion, click-through, error rate — against the production control group. A mismatch detectable as a statistically significant regression triggers automatic rollback.

Shadow Deployment with Istio

Istio implements shadow mode via the mirror field in VirtualService. Mirrored requests arrive at the shadow service with the Host header appended with -shadow, allowing logs to be distinguished from production traffic. The production response path is never blocked — the mirror is asynchronous.

yaml
# shadow-virtualservice.yaml
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: recommendation-model
  namespace: ml-serving
spec:
  hosts:
  - recommendation-model
  http:
  - route:
    - destination:
        host: recommendation-model
        subset: v1          # production model — 100% of responses to users
      weight: 100
    mirror:
      host: recommendation-model
      subset: v2            # shadow candidate — responses discarded
    mirrorPercentage:
      value: 20.0           # mirror 20% of production traffic
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: recommendation-model
  namespace: ml-serving
spec:
  host: recommendation-model
  subsets:
  - name: v1
    labels:
      version: "2024-q4"   # current production
  - name: v2
    labels:
      version: "2025-q1"   # shadow candidate

The shadow service MUST be deployed with write-path isolation: no writes to shared databases, no Kafka produce calls, no email sends. A common pattern is to inject an environment variable SHADOW_MODE=true into the shadow deployment and guard all side-effect paths:

python
import os
import logging

SHADOW_MODE = os.getenv("SHADOW_MODE", "false").lower() == "true"

def record_prediction(user_id: str, prediction: float, model_version: str) -> None:
    """Write prediction to audit log. Skipped in shadow mode."""
    if SHADOW_MODE:
        logging.info(
            "shadow_prediction user_id=%s prediction=%.4f version=%s",
            user_id, prediction, model_version,
        )
        return  # do not write to production DB

    db.execute(
        "INSERT INTO predictions (user_id, prediction, model_version, ts) VALUES (?, ?, ?, NOW())",
        (user_id, prediction, model_version),
    )

Offline Label Joining for Shadow Evaluation

Shadow predictions must be joined with delayed ground truth labels to compute accuracy metrics. Ground truth often arrives hours after the prediction (e.g., a purchase event after a recommendation, a trip completion after an ETA prediction). A label joiner runs as a batch job:

python
import pandas as pd
from datetime import datetime, timedelta


def join_shadow_predictions_with_labels(
    shadow_logs: pd.DataFrame,       # columns: request_id, user_id, prediction, ts
    ground_truth: pd.DataFrame,      # columns: user_id, label, event_ts
    max_label_delay_hours: int = 24,
) -> pd.DataFrame:
    """
    For each shadow prediction, find the ground truth label that arrived
    within max_label_delay_hours after the prediction timestamp.
    Uses as-of join to avoid label leakage.
    """
    shadow_logs = shadow_logs.sort_values("ts")
    ground_truth = ground_truth.sort_values("event_ts")

    # Merge on user_id, take the first label event after the prediction
    joined = pd.merge_asof(
        shadow_logs,
        ground_truth,
        left_on="ts",
        right_on="event_ts",
        by="user_id",
        direction="forward",         # label must arrive AFTER prediction
        tolerance=pd.Timedelta(hours=max_label_delay_hours),
    )

    # Drop rows where no label arrived within the window
    joined = joined.dropna(subset=["label"])
    return joined


def compute_shadow_metrics(joined: pd.DataFrame) -> dict[str, float]:
    from sklearn.metrics import roc_auc_score, average_precision_score
    return {
        "auc": roc_auc_score(joined["label"], joined["prediction"]),
        "avg_precision": average_precision_score(joined["label"], joined["prediction"]),
        "n_predictions": len(joined),
        "label_join_rate": len(joined) / len(joined),  # fraction that got a label
    }

Canary Rollout with KServe

KServe's canaryTrafficPercent field splits traffic between the last stable version and the current spec. Setting the field to 10 sends 10% of requests to the new model and 90% to the last version that received 100% traffic. Promotion removes the field; rollback sets it to 0.

yaml
# kserve-canary.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-classifier"
  namespace: ml-production
  annotations:
    serving.kserve.io/enable-tag-routing: "true"
spec:
  predictor:
    canaryTrafficPercent: 10      # 10% canary, 90% to last stable
    minReplicas: 2
    maxReplicas: 8
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://ml-models/fraud/v2.1.0"
      resources:
        requests:
          cpu: "500m"
          memory: "1Gi"
        limits:
          cpu: "2"
          memory: "4Gi"
bash
# Promote: 100% to new version
kubectl patch isvc fraud-classifier -n ml-production \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

# Rollback: drain new version immediately
kubectl patch isvc fraud-classifier -n ml-production \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":0}]'

# Tag-based testing: call canary directly without touching the traffic split
curl -H "Host: latest-fraud-classifier-predictor-default.ml-production.example.com" \
  -H "Content-Type: application/json" \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/fraud-classifier:predict \
  -d @test-payload.json

Automated Rollback with Argo Rollouts

Argo Rollouts runs AnalysisTemplate resources at each canary step. When an analysis metric fails failureLimit times, Argo Rollouts initiates an automatic rollback to the previous stable version.

yaml
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: ml-model-quality-gate
  namespace: ml-production
spec:
  args:
  - name: service-name
  metrics:
  - name: prediction-success-rate
    interval: 60s
    successCondition: result[0] >= 0.95
    failureLimit: 3                # rollback after 3 consecutive failures
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          sum(rate(model_predictions_total{
            service="{{args.service-name}}",
            result="correct"
          }[5m])) /
          sum(rate(model_predictions_total{
            service="{{args.service-name}}"
          }[5m]))
  - name: p99-latency-ms
    interval: 60s
    thresholdRange:
      max: 200                     # rollback if p99 latency exceeds 200ms
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          histogram_quantile(0.99,
            sum(rate(model_inference_duration_seconds_bucket{
              service="{{args.service-name}}"
            }[5m])) by (le)
          ) * 1000
---
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: churn-predictor
  namespace: ml-production
spec:
  replicas: 4
  selector:
    matchLabels:
      app: churn-predictor
  template:
    metadata:
      labels:
        app: churn-predictor
    spec:
      containers:
      - name: model-server
        image: ml-registry/churn-predictor:v3.2.0
        ports:
        - containerPort: 8080
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}       # bake time at 10%
      - analysis:
          templates:
          - templateName: ml-model-quality-gate
          args:
          - name: service-name
            value: churn-predictor
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

MLflow Champion/Challenger Aliases

As of MLflow 2.9.0, model stages (Staging/Production) are deprecated in favor of mutable aliases. Aliases support arbitrary names — champion, challenger, canary, shadow — and multiple aliases can point to the same version.

python
import mlflow
from mlflow import MlflowClient

MODEL_NAME = "fraud-classifier"
client = MlflowClient()

def register_challenger(run_id: str, metrics: dict) -> int:
    """Register a new model version as challenger."""
    model_uri = f"runs:/{run_id}/model"
    mv = mlflow.register_model(model_uri, MODEL_NAME)
    version = int(mv.version)

    # Tag with evaluation metrics for audit trail
    client.set_model_version_tag(MODEL_NAME, str(version), "auc", str(metrics["auc"]))
    client.set_model_version_tag(MODEL_NAME, str(version), "eval_date", metrics["date"])

    # Assign challenger alias — routing code loads by alias
    client.set_registered_model_alias(MODEL_NAME, "challenger", version)
    return version


def promote_challenger_to_champion() -> None:
    """Atomically swap challenger → champion."""
    challenger_mv = client.get_model_version_by_alias(MODEL_NAME, "challenger")
    champion_mv = client.get_model_version_by_alias(MODEL_NAME, "champion")

    # Promote
    client.set_registered_model_alias(MODEL_NAME, "champion", int(challenger_mv.version))
    # Retain rollback alias on old champion
    client.set_registered_model_alias(MODEL_NAME, "rollback", int(champion_mv.version))
    client.delete_registered_model_alias(MODEL_NAME, "challenger")


# Load by alias in serving code — alias resolves to the correct version
champion = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
challenger = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@challenger")

Common Mistakes

Writing to production state from the shadow path. If the shadow service sends emails, updates a recommendation cache, or records an impression, users receive duplicate effects. Every write path in the shadow deployment must be disabled or mocked. The SHADOW_MODE guard at the service level is not sufficient if downstream services are shared — a separate deployment with isolated dependencies is safer.

Running shadow without latency budget validation. The purpose of shadow is not only to compare predictions but to confirm the new model version stays within the production latency budget under realistic concurrency. Run shadow at a traffic percentage that stresses the shadow pod pool — 100% shadow is acceptable if the pods are isolated — before proceeding to canary.

Setting canary windows too short. Diurnal traffic patterns mean that a two-hour canary window may capture only daytime traffic. Prediction quality for a recommendation model may degrade specifically for overnight or weekend sessions. Canary windows SHOULD be at least 24 hours for consumer-facing services.

Using the same rollback thresholds for all models. A fraud model with 0.001% false positive rate needs tighter rollback thresholds than a recommendation model measuring click-through. Define thresholds per-model class based on the business cost of the error type.

Forgetting to drain connections before rollback. An abrupt rollback mid-stream leaves in-flight requests with no response. Use a pre-stop lifecycle hook in Kubernetes to drain existing connections before the pod terminates:

yaml
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 10"]   # wait for in-flight requests

References