The MLOps Stack of 2023: What's Worth Adopting
Ten tools, five categories, one opinionated recommendation per category. The honest version — including which of these are worth the operational cost and which you can skip if your team is small.
The MLOps tool landscape in late 2023 has consolidated around roughly ten serious choices spread across five categories: experiment tracking (MLflow, Weights & Biases), orchestration (Kubeflow, Metaflow), managed platforms (SageMaker, Vertex AI), data versioning (DVC), monitoring (Evidently, WhyLabs), and serving (BentoML). Small teams should pick one per category and resist the urge to run the full stack. For most teams of fewer than ten engineers, the pragmatic default is MLflow + Metaflow + Evidently + BentoML, run on whatever cloud you already pay for.
It is December 2023, and every ML team we speak to asks us some variant of the same question: what tools should we use. The honest answer, that it depends on the size of the team and the shape of the workload, does not satisfy anyone. So in this article we are going to be more specific: we will name ten tools, map them against five categories, and give an opinionated recommendation at the end. These are our views at ML Systems Review based on actual operational experience with most of these tools, and on interviews with engineers at six companies running production ML pipelines.
The caveats, in order of importance: (1) none of these tools are bad; they are differently good. (2) Integration cost between tools is almost always higher than the sticker price of the tools themselves. (3) If you are a team of one or two, you do not need most of this.
The five categories
We divide the 2023 MLOps landscape into five operational concerns. A serious ML team needs a tool — not necessarily a separate tool — for each:
- Experiment tracking and model registry (MLflow, Weights & Biases). Where does the record of "what did we train, on what data, with what hyperparameters, and how did it do" live?
- Pipeline orchestration (Kubeflow, Metaflow, plus the managed platforms). How do training and batch-inference jobs get kicked off, scheduled, and retried?
- End-to-end managed platforms (Amazon SageMaker, Google Vertex AI). The cloud-vendor answer to "do you really want to run your own MLOps stack".
- Data and artefact versioning (DVC, LakeFS-adjacent tools). How do you reproduce a training run from three months ago when someone asks?
- Monitoring and drift detection (Evidently, WhyLabs, Arize, Fiddler). How do you know when the production model has stopped working?
- Model serving (BentoML, Seldon Core, Triton). How does a trained model artefact become an HTTP endpoint?
Yes, that is six categories. We merged "managed platforms" into the matrix because they cut across most of the others. On to the tools.
Experiment tracking: MLflow vs Weights & Biases
MLflow (version 2.9 at time of writing) is the open-source default. It has a tracking server, a model registry, a simple Python SDK, and a UI that was described to us by one engineer as "good enough that you forget it is free". In 2023, MLflow is included out of the box in Databricks, is a first-class option in every major cloud ML platform, and has a production story that works for teams in the one-to-fifty-engineer range without significant surgery. The tracking backend is Postgres or MySQL, the artefact store is S3 or equivalent, and operationally that is all there is.
Weights & Biases (W&B) is the commercial alternative, with a free tier for academic use. W&B has meaningfully better UI, better experiment-comparison tools, better hyperparameter-sweep integration, and better media artefact handling (images, audio, video that you want to log alongside numeric metrics). It costs, in 2023, between $50 and $75 per user per month on the team tier. If you are doing vision research, the UI payoff is real. If you are doing tabular-data ML or most recommender work, the MLflow UI is sufficient.
Our recommendation: MLflow for most teams. Add W&B if your team does a lot of computer-vision research and you have budget.
Orchestration: Kubeflow, Metaflow, Airflow
Kubeflow (1.7 in late 2023) is the Kubernetes-native ML orchestration platform. It is powerful, it is the backbone of Google's internal ML stack, and it is the single most operationally expensive tool in this article. To run Kubeflow well you need someone on the team who understands Kubernetes deeply — not surface-level "I can deploy a container" understanding but "I can debug a CRD's finalizer" understanding. For teams that already have platform engineers, Kubeflow is excellent. For teams that do not, it is a tax.
Metaflow (originally from Netflix, open-sourced 2019, version 2.10 in late 2023) takes a radically different approach. The user writes a Python flow with decorators (@step, @resources, @conda). Metaflow handles the rest — pickling state, launching compute on AWS Batch or Kubernetes or a local machine, versioning artefacts. The learning curve is hours, not weeks. The operational footprint is a single Metadata service, an S3 bucket, and whatever compute backend you point it at.
Airflow (2.8 in late 2023) is the generalist. It is not built for ML specifically, but many teams use it for the batch parts of their pipeline (data prep, scheduled retraining) and pair it with a more specialised tool for the training step itself. Airflow is boring, battle-tested, and widely understood. That is worth a lot.
Our recommendation: Metaflow for small-to-mid teams that care about developer ergonomics. Airflow + a thin shim for teams that already run Airflow for data engineering. Kubeflow only if you have a platform team that wants it.
Managed platforms: SageMaker and Vertex AI
Amazon SageMaker is enormous. It has a training service, a hosting service, a feature store, a model registry, a Pipelines orchestrator, a monitoring product, a notebook environment, and about fifteen other sub-products we have lost track of. In 2023, SageMaker's main selling point is that it turns an MLOps problem into an IAM problem, which is not necessarily an improvement but is at least a familiar one. Cost is on-demand and can get out of hand fast — we have seen teams spend $8,000/month on a SageMaker endpoint that could have been a $400/month EC2 instance with BentoML.
Google Vertex AI is the GCP equivalent, better integrated with BigQuery and with a cleaner pipeline abstraction (the Vertex AI Pipelines product, which is a managed Kubeflow under the hood). Vertex AI's managed training is competitive with SageMaker's and somewhat easier to reason about. Its monitoring and feature-store offerings are thinner.
Our recommendation: If you are already deep on AWS and you have an ML team under five people, SageMaker saves you real infrastructure work. If you are on GCP, Vertex is competitive. If you are a cost-sensitive startup, avoid both and pay the operational cost of the open-source stack — you will break even inside two years.
Data versioning: DVC and friends
DVC (Data Version Control) is the established choice. It extends Git to handle large files and datasets via content-addressable storage on S3 or similar. In 2023, DVC is a required tool for reproducibility but a quiet one: the best outcome is that you never think about it after the initial setup.
DVC alternatives exist (LakeFS, Pachyderm, Delta Lake's time-travel feature, Weights & Biases Artifacts) but none has dethroned DVC for the "version a dataset alongside the code that trained on it" use case. LakeFS is arguably better for data-lake-scale versioning, but most training workflows do not need that.
Our recommendation: DVC, defaulted on. It is almost free to adopt.
Monitoring: Evidently, WhyLabs, and the paid offerings
Evidently (open-source, version 0.4 in late 2023) is the pragmatic choice. It is a Python library that computes drift reports, data-quality checks, and model-performance summaries against labelled or unlabelled production data. It integrates with Airflow, Prefect, and Metaflow; the output is JSON or HTML, which you can wire into whatever dashboard you already have. No new server, no new UI to teach the team.
WhyLabs is the commercial alternative that thinks about monitoring as a dedicated observability platform. You instrument your inference service with their whylogs library, ship profiles to their SaaS backend, and get a hosted UI with alerting. It is reasonable if you have the budget and want a dedicated ML observability product; in 2023 we have seen it land in a few mid-size companies that were tired of home-rolled monitoring.
Arize and Fiddler compete in the same space as WhyLabs with different flavours of emphasis — Arize on LLM observability, Fiddler on model explainability. Both are credible; neither has run away with the market.
Our recommendation: Evidently as the default. Graduate to a commercial product only when your on-call rotation is big enough that you need dedicated ML alerting infrastructure.
Model serving: BentoML, Seldon, Triton
BentoML (version 1.1 in late 2023) is the developer-ergonomic serving framework. You write a Python service class, decorate your inference function, and bentoml build produces a Docker image with a sensible FastAPI wrapper and proper concurrency handling. For most teams, BentoML is the right default.
Seldon Core is the Kubernetes-native alternative. If you are already on Kubeflow, Seldon slots in cleanly. If you are not, its operational overhead is substantial.
NVIDIA Triton Inference Server is the high-performance choice for GPU-bound serving. If your inference workload is large-scale deep-learning (vision, language, multimodal), Triton's dynamic batching, model ensembling, and multi-framework support (TensorRT, ONNX, PyTorch, TensorFlow) are genuinely ahead of the field. The downside is the configuration burden; Triton's config format is voluminous.
Our recommendation: BentoML for most teams. Triton when you are paying for enough GPU capacity that squeezing 30% more throughput out of it matters.
The decision matrix
Tool Category License Ops cost Team size sweet spot Our rating ---------------- --------------- ------- -------- -------------------- ----------- MLflow Tracking OSS Low 2–50 engineers Default Weights & Biases Tracking Paid Very low 5–200 (research) Worth it for CV Metaflow Orchestration OSS Low 2–30 Default Kubeflow Orchestration OSS High 30+ (with platform) Conditional Airflow Orchestration OSS Medium Any (non-ML heavy) Adjacent SageMaker Platform Paid Low 2–20 on AWS Pragmatic Vertex AI Platform Paid Low 2–20 on GCP Pragmatic DVC Data versioning OSS Very low Any Default Evidently Monitoring OSS Low 2–50 Default WhyLabs Monitoring Paid Low 20+ Conditional BentoML Serving OSS Low 2–50 Default Seldon Core Serving OSS High With Kubeflow Conditional Triton Serving OSS Medium GPU-heavy workloads Specific case
The opinionated stack
If you stop reading here, this is the stack we recommend for a typical five-to-fifteen-person ML team in late 2023:
- Experiment tracking: MLflow, self-hosted, Postgres + S3.
- Orchestration: Metaflow on AWS Batch (or the equivalent on GCP).
- Data versioning: DVC alongside your Git repo.
- Monitoring: Evidently, invoked from the orchestrator, results shipped to your existing Grafana or Datadog.
- Serving: BentoML, deployed as containers behind your existing load balancer.
This stack has a total operational footprint of roughly two half-time engineers to maintain well, assuming the underlying cloud infrastructure already exists. It is not the shiniest stack — none of these tools will get you invited to a conference talk. It is the stack that works at 10:00 on a Tuesday morning when a model needs to be retrained because a feature pipeline changed.
A worked configuration
To make the recommendation concrete, here is a minimal Metaflow flow that trains a model, logs to MLflow, and registers the output. It is almost embarrassingly short:
from metaflow import FlowSpec, step, Parameter, resources
import mlflow
import mlflow.sklearn
class HomeFeedTrainingFlow(FlowSpec):
n_estimators = Parameter("n_estimators", default=200)
mlflow_uri = Parameter("mlflow_uri", default="http://mlflow.internal:5000")
@step
def start(self):
self.next(self.load_data)
@step
def load_data(self):
import pandas as pd
self.df = pd.read_parquet("s3://training/home-feed/2023-12-11.parquet")
self.next(self.train)
@resources(cpu=16, memory=64000)
@step
def train(self):
from sklearn.ensemble import GradientBoostingClassifier
mlflow.set_tracking_uri(self.mlflow_uri)
mlflow.set_experiment("home-feed-weekly")
X = self.df.drop(columns=["label"])
y = self.df["label"]
with mlflow.start_run(run_name=self.current.run_id) as run:
model = GradientBoostingClassifier(n_estimators=self.n_estimators)
model.fit(X, y)
mlflow.log_param("n_estimators", self.n_estimators)
mlflow.log_metric("train_acc", model.score(X, y))
mlflow.sklearn.log_model(model, "model",
registered_model_name="home-feed")
self.mlflow_run_id = run.info.run_id
self.next(self.end)
@step
def end(self):
print(f"Training complete. MLflow run: {self.mlflow_run_id}")
if __name__ == "__main__":
HomeFeedTrainingFlow()
Ten lines of actual logic, run by python flow.py run --with batch to execute on AWS Batch or by python flow.py run to execute locally. The Metaflow UI records every run, the MLflow registry holds every trained model, and the data snapshot is versioned by the Parquet path. That is 80% of what an "MLOps stack" does.
What this article deliberately does not cover
Feature stores (Feast, Tecton, Featureform) are worth an article of their own and we are writing one separately. Vector databases (Pinecone, Weaviate, Qdrant) are adjacent to MLOps but properly belong in a different category. And the emerging LLMOps tooling (LangSmith, TruLens, and the first wave of LLM-specific observability products) is changing fast enough that we did not want to put it in a year-end article that would be obsolete by February.
Closing note
The honest summary, looking across these ten tools in late 2023: the MLOps landscape is finally boring, which is a compliment. Three years ago, every team was building their own experiment-tracking database and their own feature-store from Redis. Today, the defaults work. A team that adopts MLflow + Metaflow + Evidently + BentoML + DVC will have a better MLOps stack than most Fortune 500 companies had in 2020, and the total adoption cost is measured in weeks, not quarters.
The real remaining questions are not about tools but about the practices around them: how to structure a feature review, how to run an ML incident post-mortem, how to decide when a model is "done" and ready for deployment. Those are the subjects we plan to cover in 2024.
Reviewed for technical accuracy by Dr. Nadia Volkov. The opinions here are the author's; MLSR has no financial relationship with any of the tools listed.