[BEE-30089] 測試機器學習管道

INFO

ML 管道測試將軟體工程測試規範應用於資料依賴、概率性系統的獨特失效模式。與傳統軟體不同，ML 程式碼可以在不拋出例外的情況下默默地產生錯誤輸出——在錯誤百分位數截斷值的特徵轉換器、帶有分離張量的訓練迴圈，或以不同於訓練的順序應用特徵縮放的服務函數。測試在開發時就能捕捉這些失效，而不是在生產環境的預測結果中。

背景

Breck et al. 的「The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction」（Google, IEEE Big Data 2017）調查了真實的生產 ML 系統，發現大多數團隊只測試了他們面臨的失效模式的一小部分。該評分體系將 ML 測試組織為四類：資料測試、模型測試、ML 基礎設施測試和監控測試——每通過一種測試類型得 1 分，目標是達到 5 分以上才算生產就緒系統。得分低於 2 的系統被認為部署風險很高。

核心難點在於 ML 程式碼有三個使測試困難的特性：非確定性（隨機初始化、打亂的批次、GPU 浮點數差異）、資料依賴性（錯誤是資料條件性的——相同的程式碼在單元測試中通過，但在特定資料分佈上失效），以及靜默失效（即使模型品質為零，訓練也能在不拋出例外的情況下完成）。傳統測試規範對這三個問題的解決率為 0%；ML 特定實踐對三者都有解決方案。

密閉測試設定

密閉測試（hermetic test）是隔離的、可重現的、確定性的。在 ML 中，這需要對所有隨機源進行顯式種子控制：

python

import random
import numpy as np
import torch
import os

SEED = 42

def make_hermetic() -> None:
    """在任何涉及隨機性的測試頂部呼叫。"""
    random.seed(SEED)
    np.random.seed(SEED)
    torch.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    # 強制確定性 CUDA 操作——較慢但可重現
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
    torch.use_deterministic_algorithms(True)

作為 pytest fixture 自動應用：

python

import pytest

@pytest.fixture(autouse=True)
def hermetic_seed():
    make_hermetic()
    yield
    # 不需要清理——種子是進程狀態

torch.use_deterministic_algorithms(True) 強制確定性 CUDA 核心，但有性能成本。在測試環境中使用；在生產環境中停用。注意：即使使用相同種子，CPU 和 GPU 執行之間的結果仍然不同——始終在同一設備上運行比較測試。

特徵轉換測試

特徵工程程式碼必須（MUST）使用基於屬性的測試而不僅僅是基於範例的測試。基於屬性的測試（Hypothesis 函式庫）生成數百個隨機輸入，並找到破壞不變量的邊緣案例——通常是超出範圍的值、空輸入或 NaN，這些是基於範例的測試所錯過的。

python

import pytest
import numpy as np
import pandas as pd
from hypothesis import given, settings, HealthCheck
from hypothesis import strategies as st
from hypothesis.extra.pandas import column, data_frames

from mypackage.features import log_transform, clip_outliers

# 基於範例：驗證已知行為
def test_log_transform_known_input():
    result = log_transform(pd.Series([1.0, np.e, np.e**2]))
    expected = pd.Series([0.0, 1.0, 2.0])
    pd.testing.assert_series_equal(result, expected, atol=1e-9)

# 基於屬性：驗證對任何有效輸入不變量都成立
@given(
    st.lists(
        st.floats(min_value=0.01, max_value=1e9, allow_nan=False, allow_infinity=False),
        min_size=1,
        max_size=1000,
    )
)
@settings(suppress_health_check=[HealthCheck.too_slow])
def test_log_transform_invariants(values):
    series = pd.Series(values, dtype=float)
    result = log_transform(series)

    # 不變量：輸出長度與輸入相同
    assert len(result) == len(series)
    # 不變量：正輸入的對數是有限的
    assert result.notna().all()
    # 不變量：單調性——較大的輸入 → 較大的輸出
    if len(series) >= 2:
        paired = pd.DataFrame({"x": series, "y": result}).sort_values("x")
        assert (paired["y"].diff().dropna() >= 0).all()

# sklearn 轉換器契約合規性
from sklearn.utils.estimator_checks import check_estimator
from mypackage.features import ClipOutliersTransformer

def test_clip_outliers_transformer_sklearn_contract():
    """驗證自定義轉換器滿足完整的 sklearn 估計器 API。"""
    check_estimator(ClipOutliersTransformer())

check_estimator 運行 sklearn 的內部測試套件——約 100 個檢查，涵蓋 fit/transform 契約、clone 行為、序列化和邊緣案例。任何自定義 sklearn 轉換器在部署到 Pipeline 之前必須（MUST）通過此測試。

訓練管道冒煙測試

冒煙測試在最小資料集（1000 行，1 個 epoch）上運行完整訓練管道，以驗證程式碼路徑能夠無錯誤地完成。它捕捉：資料加載錯誤、不相容的張量形狀、缺失特徵、不正確的損失函數設定。

python

import torch
import torch.nn as nn
from mypackage.train import build_model, build_dataloader, train_one_epoch

def test_training_pipeline_smoke(tmp_path):
    """完整訓練管道在 1000 行、1 個 epoch 上完成。"""
    make_hermetic()

    # CI 使用 CPU——GPU 測試保留給整合階段
    device = torch.device("cpu")
    dataloader = build_dataloader(
        data_path="tests/fixtures/sample_1000.parquet",
        batch_size=32,
        device=device,
    )
    model = build_model(input_dim=30, hidden_dim=64, output_dim=1).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.BCEWithLogitsLoss()

    loss = train_one_epoch(model, dataloader, optimizer, criterion, device)

    # 損失必須是有限且正的
    assert torch.isfinite(torch.tensor(loss)), f"損失不是有限的：{loss}"
    assert loss > 0, f"損失非正：{loss}"

def test_initial_loss_sanity():
    """對於隨機初始化的 N 類交叉熵，損失 ≈ ln(N)。"""
    make_hermetic()
    N_CLASSES = 10
    model = build_classifier(input_dim=30, n_classes=N_CLASSES)
    X = torch.randn(256, 30)
    y = torch.randint(0, N_CLASSES, (256,))

    logits = model(X)
    loss = nn.CrossEntropyLoss()(logits, y).item()

    expected = np.log(N_CLASSES)  # ≈ 2.303
    assert abs(loss - expected) < 0.5, (
        f"初始損失 {loss:.3f} 與 ln({N_CLASSES})={expected:.3f} 偏差過大。"
        f"請檢查標籤編碼或損失函數設定。"
    )

def test_gradients_flow_to_all_parameters():
    """一次反向傳播後，每個可訓練參數都應接收梯度。"""
    make_hermetic()
    model = build_model(input_dim=30, hidden_dim=64, output_dim=1)
    X = torch.randn(32, 30)
    y = torch.randn(32, 1)

    output = model(X)
    loss = nn.MSELoss()(output, y)
    loss.backward()

    for name, param in model.named_parameters():
        if param.requires_grad:
            assert param.grad is not None, f"參數 {name} 沒有梯度"
            assert not torch.all(param.grad == 0), f"參數 {name} 梯度為零"

初始損失健全性檢查的效益特別高：帶有 one-hot 編碼錯誤或錯誤損失函數的模型會產生遠離 ln(N) 的初始損失，在任何訓練開始之前就使測試失敗。

行為測試

行為測試（Ribeiro et al.，「Beyond Accuracy: Behavioral Testing of NLP Models with CheckList」，ACL 2020 最佳論文）按照不變量的類型組織測試，而非按程式碼單元：

python

# MFT（最小功能測試）：模型正確處理典型輸入
def test_mft_high_risk_user_predicted_positive():
    """具有所有高風險特徵的使用者必須（MUST）收到正向詐騙預測。"""
    high_risk_features = {
        "transaction_amount": 9999.0,
        "is_new_card": 1,
        "country_mismatch": 1,
        "time_since_last_txn_minutes": 2,
    }
    prediction = model.predict_proba([high_risk_features])[0, 1]
    assert prediction > 0.8, f"高風險使用者的分數應該 >0.8，得到 {prediction:.3f}"

# INV（不變性測試）：無關特徵改變時預測不應（MUST NOT）改變
def test_inv_user_name_does_not_affect_prediction():
    """改變使用者名稱不應影響詐騙預測（名稱不是模型特徵）。"""
    base = {"transaction_amount": 500.0, "is_new_card": 0, ...}
    perturbed = {**base, "user_name": "different_name"}
    assert model.predict_proba([base])[0, 1] == model.predict_proba([perturbed])[0, 1]

# DIR（方向期望測試）：預測應（SHOULD）以預期方向改變
def test_dir_higher_amount_increases_fraud_score():
    """將交易金額加倍應（SHOULD）增加詐騙概率。"""
    base = {"transaction_amount": 200.0, "is_new_card": 0, "country_mismatch": 0, ...}
    high_amount = {**base, "transaction_amount": 400.0}

    score_base = model.predict_proba([base])[0, 1]
    score_high = model.predict_proba([high_amount])[0, 1]
    assert score_high > score_base, (
        f"較高金額應該增加詐騙分數："
        f"基準={score_base:.3f}，翻倍={score_high:.3f}"
    )

模型回歸測試

回歸測試驗證新訓練的模型不會相對於已知基準無聲地退化。將基準預測儲存為固定裝置（fixture）並在容差範圍內進行比較：

python

import pytest
import numpy as np

BASELINE_PREDICTIONS_PATH = "tests/fixtures/baseline_predictions.npy"

def test_model_predictions_match_baseline():
    """模型輸出必須（MUST）在 1% 的相對容差內與基準匹配。"""
    X_test = np.load("tests/fixtures/X_test_100.npy")
    baseline = np.load(BASELINE_PREDICTIONS_PATH)

    predictions = model.predict_proba(X_test)[:, 1]

    # pytest.approx 支持帶容差的陣列比較
    assert predictions == pytest.approx(baseline, rel=0.01), (
        "模型預測與基準偏差 >1%。"
        "重新訓練基準或調查模型變化。"
    )

def test_model_calibration_within_tolerance():
    """預測概率應（SHOULD）校準良好：mean(pred) ≈ mean(actual)。"""
    X_test = np.load("tests/fixtures/X_test_10k.npy")
    y_test = np.load("tests/fixtures/y_test_10k.npy")

    proba = model.predict_proba(X_test)[:, 1]
    predicted_rate = proba.mean()
    actual_rate = y_test.mean()

    assert abs(predicted_rate - actual_rate) < 0.05, (
        f"模型校準不良：預測={predicted_rate:.3f}，實際={actual_rate:.3f}"
    )

CI 整合

ML 管道測試必須（MUST）在每個針對 main 分支的 pull request 上運行。使用 DVC 快取昂貴的資料製品，並僅運行受影響的管道步驟：

yaml

# .github/workflows/ml-tests.yml
name: ML Pipeline Tests

on: [pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -e ".[test]"

      - name: Cache DVC artifacts
        uses: actions/cache@v4
        with:
          path: .dvc/cache
          key: dvc-${{ hashFiles('dvc.lock') }}

      - name: Pull test fixtures
        run: dvc pull tests/fixtures

      - name: Run unit tests (fast, CPU-only)
        run: pytest tests/unit -v --timeout=120

      - name: Run pipeline smoke test (1k rows, 1 epoch)
        run: pytest tests/integration/test_training_smoke.py -v --timeout=300
        env:
          CUBLAS_WORKSPACE_CONFIG: ":16:8"

設立管道閘門：單元測試必須（MUST）在 2 分鐘內通過。冒煙測試必須（MUST）在 5 分鐘內在 2 核心運行器上完成。GPU 整合測試可以（MAY）僅在合併到 main 時運行，而非每次 PR。

常見錯誤

只測試正常路徑。 特徵轉換器在 NaN 值、空 DataFrame、單元素 Series 和超出範圍的輸入上會失效。使用 Hypothesis 的基於屬性的測試能自動發現這些輸入。只使用乾淨資料的基於範例的測試會錯過出現在生產資料上的錯誤。

沒有對所有隨機源設定種子。 只設定 numpy.random.seed 而忘記 torch.manual_seed 或 random.seed 會在測試中留下隨機性。症狀是不穩定的測試，有 90% 的時間通過——診斷代價高昂。使用 make_hermetic() 模式作為 autouse fixture。

對過時的基準模型運行行為測試。 行為測試（test_dir_*、test_inv_*）驗證當前模型製品的屬性。如果模型被重新訓練，基準預測會發生變化。在測試 fixture 中鎖定模型製品版本，並在每次重新訓練時有意識地更新它。

將測試逾時失敗視為基礎設施問題。 需要 20 分鐘的冒煙測試不需要更快的 CI 運行器——它需要更小的測試資料集。測試資料 fixture必須（MUST）足夠小，能夠在普通硬體上在 5 分鐘內運行。這強制了快速回饋並防止 CI 成為瓶頸。

跳過初始損失健全性檢查。 ln(N) 測試在第一個梯度步驟之前就能捕捉標籤編碼錯誤、錯誤的損失函數參數和架構錯誤。跳過它意味著這些錯誤在數小時的 GPU 訓練後才被發現，而不是在幾秒鐘內。

參考資料

Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. IEEE Big Data 2017. https://research.google/pubs/the-ml-test-score-a-rubric-for-ml-production-readiness-and-technical-debt-reduction/
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. ACL 2020 Best Paper. arXiv:2005.04118. https://aclanthology.org/2020.acl-main.442/
Hypothesis 文件. https://hypothesis.readthedocs.io/
PyTorch, 可重現性. https://docs.pytorch.org/docs/stable/notes/randomness.html
PyTorch, torch.use_deterministic_algorithms. https://docs.pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
scikit-learn, check_estimator. https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.check_estimator.html
DVC, 資料管道指南. https://doc.dvc.org/start/data-pipelines/data-pipelines
pytest 文件, pytest.approx. https://docs.pytest.org/en/stable/reference/reference.html

[BEE-30089] 測試機器學習管道 ​

背景 ​

密閉測試設定 ​

特徵轉換測試 ​

訓練管道冒煙測試 ​

行為測試 ​

模型回歸測試 ​

CI 整合 ​

常見錯誤 ​

相關 BEE ​

參考資料 ​

[BEE-30089] 測試機器學習管道

背景

密閉測試設定

特徵轉換測試

訓練管道冒煙測試

行為測試

模型回歸測試

CI 整合

常見錯誤

相關 BEE

參考資料