A Week in the Life of a Production ML Pipeline
Monday morning a drift alert fires. By Friday afternoon a new model is in production. Here is what happens in between, step by step, at a small company with a small team.
A typical week operating a production recommendation pipeline at a Series B startup in 2023 involves a predictable sequence: Monday drift alert from Evidently, Tuesday root-cause investigation in the feature store, Wednesday retrain on Kubeflow, Thursday shadow deploy behind a flag, Friday 10% canary. This article walks through each day with the tools, the metrics, and the decisions — including the ones that go wrong.
At the time of writing in 2023, I am on the MLOps team at a Series B consumer-marketplace startup. We have three ML engineers, one data engineer, and one infrastructure generalist. Between us we own four production models — two recommenders, a fraud classifier, and a ranking model for search. The recommenders get most of the operational attention because they drive the home-feed that keeps the lights on.
What follows is a composite week, sanitised of specifics. The pipeline we are operating is a matrix-factorisation-plus-gradient-boosted-trees recommender that scores roughly 11 million candidate items per day and serves a personalised feed to about 280,000 daily active users. The model is retrained weekly on a rolling 90-day window. The infrastructure is a mix of Airflow for batch orchestration, Kubeflow Pipelines for training, MLflow for experiment tracking and the model registry, Evidently for drift monitoring, and BentoML wrapped in a simple FastAPI service for serving.
Monday: the alert
The week starts at 08:47 Pacific with a Slack ping from #ml-alerts. Evidently has flagged a drift score of 0.31 on the user_last_7d_categories feature, well above our 0.15 alert threshold. This is the Population Stability Index variant Evidently uses for categorical features. A value above 0.25 is what the textbook calls "significant shift".
First question, always: is this a real drift or a data-pipeline bug. We check the upstream Airflow DAG that populates the feature. It completed successfully at 04:15 and the row counts look normal. We check the raw event volume in our data warehouse — also normal. So the shape of the feature distribution has genuinely moved.
Second question: does it matter. A drift signal is not automatically a performance problem. We pull the last 14 days of online NDCG@10 from our offline evaluation job, which reprocesses the previous day's impressions against held-out labels. NDCG is steady at 0.412. Click-through rate on the home feed, from our product analytics, is down 0.4 percentage points week-over-week, which is within the normal noise band. No customer-facing regression yet.
Third question: why. A quick join between the drifted feature and our event log shows that a marketing campaign launched over the weekend has pushed an atypical user segment into the "home-and-garden" category. The feature is not broken; the population has shifted. We log the incident in our runbook, note that the weekly retrain on Wednesday should absorb the new distribution, and move on.
Tuesday: root cause and candidate experiments
Tuesday is for the less urgent tasks that accumulated during on-call. The drift from Monday is still on the board. Even if the weekly retrain should absorb it, we want to understand whether the model architecture is sensitive to this kind of categorical shift in the first place.
We spin up a Jupyter notebook attached to our feature store (Feast, at the time). We pull the last 30 days of training samples, bucket by the user_last_7d_categories feature, and run the current production model against each bucket separately. Two buckets show a meaningful hit-rate drop: "home-and-garden" (down 7.2%) and "seasonal" (down 4.1%). The rest are stable. This is the kind of result where an experienced ML engineer will say "the model has a recency problem in sparse categorical buckets" and they will usually be right.
We open two MLflow experiments. Experiment A reduces the sliding-window from 90 days to 60. Experiment B keeps the 90-day window but adds an L2 regularisation term on the category embeddings to reduce overfitting on high-volume buckets. Both kick off on our Kubeflow cluster against the same held-out test set. Each training run takes about 3 hours and 40 minutes on a single g5.12xlarge node with four A10G GPUs. We go to lunch.
Wednesday: the retrain
Wednesday is retrain day. The regular cron job kicks off at 06:00 and produces a fresh model candidate by about 10:30. The Kubeflow pipeline has eight stages:
- Data snapshot — freeze the last 90 days of interaction data into a Parquet table in S3, versioned by run ID.
- Feature materialisation — Feast materialise-job writes the online feature view from the offline store.
- Train/validation split — 85/15 time-based split, not random, because this is a recommender.
- Training — the actual gradient-boosted layer on top of frozen matrix-factorisation embeddings, about 1.1 billion parameters on the MF side.
- Offline eval — NDCG@10, MAP@20, catalogue coverage, and a long-tail lift metric we care about.
- Model registration — artefact goes to MLflow's model registry, tagged
staging. - Gate — automated check against last week's model on the same eval set. We require NDCG@10 within 0.5% and catalogue coverage within 2%.
- Notify — Slack ping to
#ml-releases.
Today the gate fails. The new model's NDCG@10 is 0.408 versus the incumbent's 0.412 — a 1.0% drop, outside the tolerance. The cause turns out to be the experiment A change from Tuesday that one of us merged prematurely to the training config. We revert, rerun, and by 15:45 we have a candidate that passes: NDCG@10 of 0.413, catalogue coverage up 1.1%. It goes to the registry tagged staging.
Here is the Airflow DAG fragment that schedules all of this. It is straightforward and embarrassingly short for something that is supposed to be a central nervous system:
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from datetime import datetime, timedelta
default_args = {
"owner": "ml-platform",
"retries": 1,
"retry_delay": timedelta(minutes=15),
}
with DAG(
dag_id="home_feed_recommender_retrain",
default_args=default_args,
schedule_interval="0 6 * * 3", # Wednesday 06:00 UTC
start_date=datetime(2023, 1, 4),
catchup=False,
max_active_runs=1,
tags=["ml", "recommender", "weekly"],
) as dag:
retrain = KubernetesPodOperator(
task_id="kfp_retrain",
name="home-feed-retrain",
image="registry.internal/ml/home-feed:v2.14.0",
cmds=["python", "-m", "pipelines.retrain"],
arguments=["--window-days", "90", "--mlflow-experiment", "home-feed-weekly"],
get_logs=True,
is_delete_operator_pod=True,
in_cluster=True,
)
gate = KubernetesPodOperator(
task_id="eval_gate",
name="eval-gate",
image="registry.internal/ml/home-feed:v2.14.0",
cmds=["python", "-m", "pipelines.gate"],
arguments=["--ndcg-tolerance", "0.005", "--coverage-tolerance", "0.02"],
)
retrain >> gate Thursday: shadow deploy
The staging model does not go straight to production. It goes into a shadow mode: the BentoML service receives each incoming request, scores it with both the production model and the staging model, returns the production result to the user, and logs the staging result to a Kafka topic for offline comparison. Shadow deploys are cheap insurance. They catch the problems that never show up in offline evaluation — feature-pipeline skew, encoding mismatches, the moment you realise the online embeddings are from a different training run than you thought.
We run the shadow deploy for 18 hours, collect about 4.2 million paired predictions, and compute the KL divergence between the score distributions. The divergence is 0.08, which is within what we have historically tolerated for a weekly retrain (typical range 0.05 to 0.12). Nothing catastrophic. The staging model also shows a 0.6% improvement in a back-test against Thursday's actual clicks, using the same labels-from-future methodology as the offline eval.
Friday: canary and rollout
Friday morning we promote the staging model to a 10% canary. The traffic-splitting happens at the feature-flag layer (LaunchDarkly), not at the model-server layer, so we can key it on user ID and keep each user on a consistent model for the duration of their session. 10% for two hours. The online metrics panel in Grafana shows CTR for the canary cohort at 11.8% versus 11.4% for the control, a 3.5% relative lift, well inside statistical noise but at least not negative. P99 latency is 47ms for canary, 44ms for control. That extra 3ms is within budget.
We ramp to 50% at 13:00 and 100% at 15:30. The old model artefact stays in the registry tagged archive for 14 days in case we need a fast rollback. The MLflow run ID is pinned in a runbook entry so anyone can find it at 3 a.m. on a Saturday.
Day | Phase | Tool | Key metric -----------+------------------------+-------------------+------------------------- Monday | Drift alert | Evidently 0.4 | PSI 0.31 on cat feature Tuesday | Root-cause + exp prep | MLflow, Feast | hit-rate bucket analysis Wednesday | Scheduled retrain | Kubeflow, Airflow | NDCG@10 = 0.413 Thursday | Shadow deploy | BentoML, Kafka | KL divergence 0.08 Friday AM | 10% canary | LaunchDarkly | CTR +3.5% rel (noisy) Friday PM | 50% -> 100% rollout | LaunchDarkly | p99 latency 47ms
What goes wrong (and will)
This article makes the process sound orderly. It is orderly most weeks. But one in four weeks something breaks in a way the runbook does not cover. Here are the three recurring categories, in the order they happen to us:
- Training-serving skew. The offline pipeline computes a feature one way; the online feature store stores a subtly different version. You only notice when the shadow KL divergence is suddenly 0.4 and the model "looks broken" in production even though offline eval is clean. The fix is always the same: make the offline and online features come from the same code path. It is always harder than it sounds.
- Silent label leakage. Someone adds a feature that includes information from the future — a click that happened after the impression we are trying to predict. Offline NDCG goes up, online CTR goes flat or down. The cure is strict temporal splits and a paranoid reviewer on every feature PR.
- Retrain loops. The model learns to prefer items that its previous version surfaced more often, because those are the items that got the clicks. The feed becomes narrower every week. We mitigate with a catalogue-coverage gate, an explore-exploit layer, and an annual "fairness" audit where we re-score against a held-out uniform sample. It is not a solved problem.
The operating principle
The thing the textbooks get right and the blog posts mostly do not is that the ML part of MLOps is the easy part. The model trains. It converges. It has reasonable validation loss. The operational part — the drift alert that fires at a coherent time, the retrain that finishes in a predictable window, the rollback that works when you need it — is what you are actually maintaining. In the six months we have been running the current architecture, the model has changed three times and the pipeline around it has changed fourteen.
The weekly cadence is not there because the model needs to be retrained weekly. It is there because we have built a team that can handle one controlled change per week without getting burnt out. If we tried to ship every model improvement immediately, the on-call rotation would collapse in a month.
Production ML is a systems discipline. In 2023 the tools are finally mature enough — Kubeflow, MLflow, Evidently, Airflow, BentoML, Feast — that a five-person team can run a real pipeline without rebuilding the scaffolding from scratch. That is the quiet victory of the last three years. The loud one, the one everyone is talking about, will be obsolete by 2024.
Reviewed for technical accuracy by Dr. Nadia Volkov before publication. Corrections to editors@mlsystemsreview.com.