ML Systems Review

The MLOps Stack of 2023: What's Worth Adopting

Ten tools, five categories, one opinionated recommendation per category. The honest version — including which of these are worth the operational cost and which you can skip if your team is small.

MLOps
By Lukas Berg , MS Reviewed by Dr. Nadia Volkov , PhD
10 min read
TL;DR

The MLOps tool landscape in late 2023 has consolidated around roughly ten serious choices spread across five categories: experiment tracking (MLflow, Weights & Biases), orchestration (Kubeflow, Metaflow), managed platforms (SageMaker, Vertex AI), data versioning (DVC), monitoring (Evidently, WhyLabs), and serving (BentoML). Small teams should pick one per category and resist the urge to run the full stack. For most teams of fewer than ten engineers, the pragmatic default is MLflow + Metaflow + Evidently + BentoML, run on whatever cloud you already pay for.

Updated 2026: Looking back from today, two of the tools on this list have dramatically different trajectories than we predicted. Weights & Biases has become the de-facto standard for LLM fine-tuning workflows, driven by its Weave subproduct. Kubeflow, meanwhile, has lost ground to Dagster, Prefect 3, and Flyte. The rest of the matrix has aged well.

It is December 2023, and every ML team we speak to asks us some variant of the same question: what tools should we use. The honest answer, that it depends on the size of the team and the shape of the workload, does not satisfy anyone. So in this article we are going to be more specific: we will name ten tools, map them against five categories, and give an opinionated recommendation at the end. These are our views at ML Systems Review based on actual operational experience with most of these tools, and on interviews with engineers at six companies running production ML pipelines.

The caveats, in order of importance: (1) none of these tools are bad; they are differently good. (2) Integration cost between tools is almost always higher than the sticker price of the tools themselves. (3) If you are a team of one or two, you do not need most of this.

The five categories

We divide the 2023 MLOps landscape into five operational concerns. A serious ML team needs a tool — not necessarily a separate tool — for each:

  1. Experiment tracking and model registry (MLflow, Weights & Biases). Where does the record of "what did we train, on what data, with what hyperparameters, and how did it do" live?
  2. Pipeline orchestration (Kubeflow, Metaflow, plus the managed platforms). How do training and batch-inference jobs get kicked off, scheduled, and retried?
  3. End-to-end managed platforms (Amazon SageMaker, Google Vertex AI). The cloud-vendor answer to "do you really want to run your own MLOps stack".
  4. Data and artefact versioning (DVC, LakeFS-adjacent tools). How do you reproduce a training run from three months ago when someone asks?
  5. Monitoring and drift detection (Evidently, WhyLabs, Arize, Fiddler). How do you know when the production model has stopped working?
  6. Model serving (BentoML, Seldon Core, Triton). How does a trained model artefact become an HTTP endpoint?

Yes, that is six categories. We merged "managed platforms" into the matrix because they cut across most of the others. On to the tools.

Experiment tracking: MLflow vs Weights & Biases

MLflow (version 2.9 at time of writing) is the open-source default. It has a tracking server, a model registry, a simple Python SDK, and a UI that was described to us by one engineer as "good enough that you forget it is free". In 2023, MLflow is included out of the box in Databricks, is a first-class option in every major cloud ML platform, and has a production story that works for teams in the one-to-fifty-engineer range without significant surgery. The tracking backend is Postgres or MySQL, the artefact store is S3 or equivalent, and operationally that is all there is.

Weights & Biases (W&B) is the commercial alternative, with a free tier for academic use. W&B has meaningfully better UI, better experiment-comparison tools, better hyperparameter-sweep integration, and better media artefact handling (images, audio, video that you want to log alongside numeric metrics). It costs, in 2023, between $50 and $75 per user per month on the team tier. If you are doing vision research, the UI payoff is real. If you are doing tabular-data ML or most recommender work, the MLflow UI is sufficient.

Our recommendation: MLflow for most teams. Add W&B if your team does a lot of computer-vision research and you have budget.

Orchestration: Kubeflow, Metaflow, Airflow

Kubeflow (1.7 in late 2023) is the Kubernetes-native ML orchestration platform. It is powerful, it is the backbone of Google's internal ML stack, and it is the single most operationally expensive tool in this article. To run Kubeflow well you need someone on the team who understands Kubernetes deeply — not surface-level "I can deploy a container" understanding but "I can debug a CRD's finalizer" understanding. For teams that already have platform engineers, Kubeflow is excellent. For teams that do not, it is a tax.

Metaflow (originally from Netflix, open-sourced 2019, version 2.10 in late 2023) takes a radically different approach. The user writes a Python flow with decorators (@step, @resources, @conda). Metaflow handles the rest — pickling state, launching compute on AWS Batch or Kubernetes or a local machine, versioning artefacts. The learning curve is hours, not weeks. The operational footprint is a single Metadata service, an S3 bucket, and whatever compute backend you point it at.

Airflow (2.8 in late 2023) is the generalist. It is not built for ML specifically, but many teams use it for the batch parts of their pipeline (data prep, scheduled retraining) and pair it with a more specialised tool for the training step itself. Airflow is boring, battle-tested, and widely understood. That is worth a lot.

Our recommendation: Metaflow for small-to-mid teams that care about developer ergonomics. Airflow + a thin shim for teams that already run Airflow for data engineering. Kubeflow only if you have a platform team that wants it.

Managed platforms: SageMaker and Vertex AI

Amazon SageMaker is enormous. It has a training service, a hosting service, a feature store, a model registry, a Pipelines orchestrator, a monitoring product, a notebook environment, and about fifteen other sub-products we have lost track of. In 2023, SageMaker's main selling point is that it turns an MLOps problem into an IAM problem, which is not necessarily an improvement but is at least a familiar one. Cost is on-demand and can get out of hand fast — we have seen teams spend $8,000/month on a SageMaker endpoint that could have been a $400/month EC2 instance with BentoML.

Google Vertex AI is the GCP equivalent, better integrated with BigQuery and with a cleaner pipeline abstraction (the Vertex AI Pipelines product, which is a managed Kubeflow under the hood). Vertex AI's managed training is competitive with SageMaker's and somewhat easier to reason about. Its monitoring and feature-store offerings are thinner.

Our recommendation: If you are already deep on AWS and you have an ML team under five people, SageMaker saves you real infrastructure work. If you are on GCP, Vertex is competitive. If you are a cost-sensitive startup, avoid both and pay the operational cost of the open-source stack — you will break even inside two years.

Data versioning: DVC and friends

DVC (Data Version Control) is the established choice. It extends Git to handle large files and datasets via content-addressable storage on S3 or similar. In 2023, DVC is a required tool for reproducibility but a quiet one: the best outcome is that you never think about it after the initial setup.

DVC alternatives exist (LakeFS, Pachyderm, Delta Lake's time-travel feature, Weights & Biases Artifacts) but none has dethroned DVC for the "version a dataset alongside the code that trained on it" use case. LakeFS is arguably better for data-lake-scale versioning, but most training workflows do not need that.

Our recommendation: DVC, defaulted on. It is almost free to adopt.

Monitoring: Evidently, WhyLabs, and the paid offerings

Evidently (open-source, version 0.4 in late 2023) is the pragmatic choice. It is a Python library that computes drift reports, data-quality checks, and model-performance summaries against labelled or unlabelled production data. It integrates with Airflow, Prefect, and Metaflow; the output is JSON or HTML, which you can wire into whatever dashboard you already have. No new server, no new UI to teach the team.

WhyLabs is the commercial alternative that thinks about monitoring as a dedicated observability platform. You instrument your inference service with their whylogs library, ship profiles to their SaaS backend, and get a hosted UI with alerting. It is reasonable if you have the budget and want a dedicated ML observability product; in 2023 we have seen it land in a few mid-size companies that were tired of home-rolled monitoring.

Arize and Fiddler compete in the same space as WhyLabs with different flavours of emphasis — Arize on LLM observability, Fiddler on model explainability. Both are credible; neither has run away with the market.

Our recommendation: Evidently as the default. Graduate to a commercial product only when your on-call rotation is big enough that you need dedicated ML alerting infrastructure.

Model serving: BentoML, Seldon, Triton

BentoML (version 1.1 in late 2023) is the developer-ergonomic serving framework. You write a Python service class, decorate your inference function, and bentoml build produces a Docker image with a sensible FastAPI wrapper and proper concurrency handling. For most teams, BentoML is the right default.

Seldon Core is the Kubernetes-native alternative. If you are already on Kubeflow, Seldon slots in cleanly. If you are not, its operational overhead is substantial.

NVIDIA Triton Inference Server is the high-performance choice for GPU-bound serving. If your inference workload is large-scale deep-learning (vision, language, multimodal), Triton's dynamic batching, model ensembling, and multi-framework support (TensorRT, ONNX, PyTorch, TensorFlow) are genuinely ahead of the field. The downside is the configuration burden; Triton's config format is voluminous.

Our recommendation: BentoML for most teams. Triton when you are paying for enough GPU capacity that squeezing 30% more throughput out of it matters.

The decision matrix

Table 1. Decision matrix, late 2023.
  Tool                Category           License   Ops cost   Team size sweet spot   Our rating
  ----------------    ---------------    -------   --------   --------------------   -----------
  MLflow              Tracking           OSS       Low        2–50 engineers         Default
  Weights & Biases    Tracking           Paid      Very low   5–200 (research)       Worth it for CV
  Metaflow            Orchestration      OSS       Low        2–30                   Default
  Kubeflow            Orchestration      OSS       High       30+ (with platform)    Conditional
  Airflow             Orchestration      OSS       Medium     Any (non-ML heavy)     Adjacent
  SageMaker           Platform           Paid      Low        2–20 on AWS            Pragmatic
  Vertex AI           Platform           Paid      Low        2–20 on GCP            Pragmatic
  DVC                 Data versioning    OSS       Very low   Any                    Default
  Evidently           Monitoring         OSS       Low        2–50                   Default
  WhyLabs             Monitoring         Paid      Low        20+                    Conditional
  BentoML             Serving            OSS       Low        2–50                   Default
  Seldon Core         Serving            OSS       High       With Kubeflow          Conditional
  Triton              Serving            OSS       Medium     GPU-heavy workloads    Specific case

The opinionated stack

If you stop reading here, this is the stack we recommend for a typical five-to-fifteen-person ML team in late 2023:

  • Experiment tracking: MLflow, self-hosted, Postgres + S3.
  • Orchestration: Metaflow on AWS Batch (or the equivalent on GCP).
  • Data versioning: DVC alongside your Git repo.
  • Monitoring: Evidently, invoked from the orchestrator, results shipped to your existing Grafana or Datadog.
  • Serving: BentoML, deployed as containers behind your existing load balancer.

This stack has a total operational footprint of roughly two half-time engineers to maintain well, assuming the underlying cloud infrastructure already exists. It is not the shiniest stack — none of these tools will get you invited to a conference talk. It is the stack that works at 10:00 on a Tuesday morning when a model needs to be retrained because a feature pipeline changed.

A worked configuration

To make the recommendation concrete, here is a minimal Metaflow flow that trains a model, logs to MLflow, and registers the output. It is almost embarrassingly short:

from metaflow import FlowSpec, step, Parameter, resources
import mlflow
import mlflow.sklearn

class HomeFeedTrainingFlow(FlowSpec):
    n_estimators = Parameter("n_estimators", default=200)
    mlflow_uri = Parameter("mlflow_uri", default="http://mlflow.internal:5000")

    @step
    def start(self):
        self.next(self.load_data)

    @step
    def load_data(self):
        import pandas as pd
        self.df = pd.read_parquet("s3://training/home-feed/2023-12-11.parquet")
        self.next(self.train)

    @resources(cpu=16, memory=64000)
    @step
    def train(self):
        from sklearn.ensemble import GradientBoostingClassifier
        mlflow.set_tracking_uri(self.mlflow_uri)
        mlflow.set_experiment("home-feed-weekly")

        X = self.df.drop(columns=["label"])
        y = self.df["label"]

        with mlflow.start_run(run_name=self.current.run_id) as run:
            model = GradientBoostingClassifier(n_estimators=self.n_estimators)
            model.fit(X, y)

            mlflow.log_param("n_estimators", self.n_estimators)
            mlflow.log_metric("train_acc", model.score(X, y))
            mlflow.sklearn.log_model(model, "model",
                registered_model_name="home-feed")

            self.mlflow_run_id = run.info.run_id
        self.next(self.end)

    @step
    def end(self):
        print(f"Training complete. MLflow run: {self.mlflow_run_id}")

if __name__ == "__main__":
    HomeFeedTrainingFlow()

Ten lines of actual logic, run by python flow.py run --with batch to execute on AWS Batch or by python flow.py run to execute locally. The Metaflow UI records every run, the MLflow registry holds every trained model, and the data snapshot is versioned by the Parquet path. That is 80% of what an "MLOps stack" does.

What this article deliberately does not cover

Feature stores (Feast, Tecton, Featureform) are worth an article of their own and we are writing one separately. Vector databases (Pinecone, Weaviate, Qdrant) are adjacent to MLOps but properly belong in a different category. And the emerging LLMOps tooling (LangSmith, TruLens, and the first wave of LLM-specific observability products) is changing fast enough that we did not want to put it in a year-end article that would be obsolete by February.

Closing note

The honest summary, looking across these ten tools in late 2023: the MLOps landscape is finally boring, which is a compliment. Three years ago, every team was building their own experiment-tracking database and their own feature-store from Redis. Today, the defaults work. A team that adopts MLflow + Metaflow + Evidently + BentoML + DVC will have a better MLOps stack than most Fortune 500 companies had in 2020, and the total adoption cost is measured in weeks, not quarters.

The real remaining questions are not about tools but about the practices around them: how to structure a feature review, how to run an ML incident post-mortem, how to decide when a model is "done" and ready for deployment. Those are the subjects we plan to cover in 2024.

Reviewed for technical accuracy by Dr. Nadia Volkov. The opinions here are the author's; MLSR has no financial relationship with any of the tools listed.