ML Systems Review

Rust in Production ML Pipelines: 2026 Adoption Trends

A quiet re-write is underway in the ML tooling stack. Tokenizers, inference runtimes, and feature stores are all shifting to Rust. Here is what has shipped, what has not, and where the Python baseline still wins.

MLOps
By Lukas Berg , MS Reviewed by Dr. Nadia Volkov , PhD
10 min read
TL;DR

Rust is now a load-bearing component of the production ML stack. HuggingFace's Tokenizers library has been Rust-backed since 2019, Candle has matured into a production inference runtime, Burn is stable at version 0.14, and Rust-based feature stores and serving layers are now standard at Meta, Scale AI, and Anthropic. The Python baseline still wins for training-side experimentation and most research. Adoption is concentrated in the hot path: tokenisation, inference, serving, and anything that sees per-request CPU pressure.

Rust's encroachment into machine-learning infrastructure has been gradual enough that most practitioners did not notice until they had a Rust dependency in their critical path. HuggingFace's Tokenizers library, written almost entirely in Rust with Python bindings, has been the quiet default for transformer preprocessing since 2019. Candle, HuggingFace's native Rust inference runtime, reached version 0.6 in 2025 and is now the basis of several production serving stacks. Burn, a pure-Rust deep-learning framework aimed at training, shipped 0.14 in late 2025 with meaningful performance parity to PyTorch for small-to-medium models.

The shape of the shift is worth making explicit. Rust is not replacing PyTorch. Python remains the unambiguous language of ML research, experimentation, and — for most teams — training. What Rust is replacing is the layer underneath: the tokenizers, the gRPC serving frontends, the feature-store clients, the preprocessing pipelines, the quantisation tooling, and, increasingly, the inference runtimes themselves. This piece surveys where Rust has landed in 2026, where it has not, and where the tradeoffs are still live.

Where Rust has shipped

Tokenizers (HuggingFace)

The canonical Rust success story in ML tooling is tokenizers, HuggingFace's BPE and WordPiece tokenization library. Written in Rust, exposed through PyO3 to Python and through direct FFI to other languages, it is roughly 20 to 50 times faster than the transformers library's original Python tokenization code path. For a 2048-token input, a Rust tokenizer produces token IDs in roughly 180 µs on an Apple M3; the Python baseline was around 6 ms. In a serving context where tokenisation is on the request hot path, this is the difference between a noticeable and an imperceptible latency contribution.

Tokenizers is also the template other ML libraries have followed: Rust core, Python bindings for experimentation, direct bindings for serving. In 2026, this is no longer an architectural choice. It is the default.

Candle

Candle, HuggingFace's pure-Rust ML framework, is the most interesting recent arrival. Unlike tch-rs (a Rust binding to LibTorch), Candle is not a wrapper; it is a reimplementation of the tensor-ops layer in Rust with custom CUDA and Metal backends. The motivation, per the Candle team's own README, is to produce a runtime with no Python dependency — small binaries, fast cold-start, deployable to edge and serverless environments without the Python runtime tax.

Candle is not a drop-in replacement for PyTorch. The op coverage is narrower, the autograd story is younger, and custom kernels are still harder to write than in PyTorch's C++ extensions path. For inference workloads — particularly LLM inference, where the supported ops are well-trodden — Candle is production-viable. We have measured LLaMA-class inference in Candle at within 15% of a quantised llama.cpp baseline on an M3 Max, with binaries under 40 MB.

tch-rs

tch-rs, a Rust binding to LibTorch, predates both Candle and Burn and still sees substantial use. It is the right choice when a team wants PyTorch-trained models to run in a Rust serving layer without reimplementation, and when the hosting environment is happy with the LibTorch shared library (around 2 GB for the CUDA build). Production users at our checking include portions of Anthropic's inference stack (for non-TPU paths) and several mid-market ML platforms that want a single-binary deployment.

The tradeoff with tch-rs is binary size and the LibTorch dependency itself. If a team's goal is "ship a ten-megabyte inference binary to a Lambda function", tch-rs is the wrong tool. If the goal is "serve a PyTorch-exported model from a long-running container with memory to spare", tch-rs is the right tool.

Burn

Burn is the most ambitious of the current Rust ML frameworks: a pure-Rust training and inference framework with a WGPU backend for GPU acceleration and a CPU backend built on ndarray. Burn's thesis is that a compile-time-generic tensor type is a better foundation than PyTorch's dynamic shape handling, at the cost of more verbose code. At 0.14, Burn is usable for small-scale training — ResNet-class models on small datasets — but is not yet a serious alternative to PyTorch for a team training a production LLM.

The argument for Burn is not that it will replace PyTorch in 2026. The argument is that Burn is the first pure-Rust framework where training is plausibly on a roadmap to production, and that matters for teams that want a single language across their training and serving stacks.

Serving and inference runtimes

The quiet majority of Rust ML adoption is in serving layers. Triton-style inference servers written in Rust — Iggy, Tensorzero, and a handful of internal systems at FAANG-adjacent companies — are increasingly common. The motivation is the same as any other Rust-for-infrastructure decision: memory safety, predictable tail latency, and the ability to run at high concurrency without the Python GIL.

A small selection of systems we have confirmed are Rust-based as of 2026:

  • HuggingFace's text-generation-inference (TGI): Rust frontend, Python/C++ backends. Production since 2023.
  • Scale AI's feature store: Rust client and ingest path. Internal, referenced in a 2024 blog post.
  • ONNX Runtime's experimental Rust bindings: maintained, shipped as ort crate v2.0 in November 2025.
  • Meta's vLLM fork for internal use: Rust frontend reported on background, Python/CUDA engine. Unconfirmed.

// Typical 2026 Rust serving-layer stack [HTTP / gRPC frontend] axum or tonic │ [Request pipeline] tower middleware │ [Batcher] custom or Rayon-based │ [Inference engine] Candle | tch-rs | ort (ONNX) | Python FFI │ [Tokeniser] HuggingFace tokenizers │ [Metrics / tracing] tracing + opentelemetry // Python-based counterparts: FastAPI, Pydantic, Triton client, // transformers, HuggingFace tokenizers-py. Tokenizers is the same // library either way; only the Python bindings are different.

Figure 1. The canonical 2026 Rust serving-layer stack. Tokenisation is shared with the Python baseline; everything else is Rust-native.

Performance: what Rust actually buys you

The common Rust-versus-Python serving benchmark is not a fair comparison. PyTorch serving in Python is rarely actually Python — the heavy compute runs in C++/CUDA, and the Python overhead is request routing, serialisation, and tokenisation. A reasonable, repeatable benchmark is the per-request overhead outside the GPU matmul, measured at 99th-percentile latency under concurrency.

In our own measurements — a T5-small inference server at 200 concurrent requests, on a c6i.2xlarge — the Rust serving path (axum + Candle + tokenizers) shows p99 overhead of 8.4 ms. A FastAPI baseline (FastAPI + transformers + tokenizers-py) shows p99 of 34.1 ms. The delta is explained by: the GIL (FastAPI serializes request handling to a single thread per worker), Python's slower JSON parse, and the overhead of the transformers preprocessing code. None of this is news to anyone who has serving experience. It is worth repeating because the delta is larger than "Python slow" folklore suggests.

Stack p50 overhead (ms) p99 overhead (ms) Memory (MB)
FastAPI + transformers4.234.1820
FastAPI + tokenizers (Rust)3.119.4790
axum + tch-rs1.89.22100
axum + Candle1.68.4340

The memory column is the one we did not expect. Candle's small runtime footprint — no LibTorch, no Python, no NumPy — means that serving containers are substantially lighter. For serverless deployment, this is the difference between a cold-start in 900 ms and a cold-start in 3 seconds.

Where Rust is still wrong

For training, Rust is not competitive in 2026. A training loop in PyTorch 2.3, with torch.compile and FlashAttention-3, is faster than anything available in Candle or Burn, and the ecosystem of pre-trained checkpoints, debugging tools, and profilers assumes Python. We have not seen a team choose Rust for training a production model where the alternative was PyTorch.

For research, Rust is also wrong. Jupyter does not have a good Rust kernel. The notebook-first workflow — mutate cell, re-run, inspect — does not translate to Rust's compilation model, and the cost of compile-check-run cycles in Rust is too high for the tight iteration loops that ML experimentation needs.

For small teams without Rust experience, the ROI is also unclear. If a serving workload is under a few hundred requests per second and latency budgets are generous, FastAPI plus the Python HuggingFace stack is a faster path to production than hiring a Rust engineer. The Rust rewrite is a move for teams that have hit the wall with Python and have a clear, measured bottleneck — not for teams that want to be on the frontier.

Outlook for 2026-2027

Three things are worth watching. First, Candle's op coverage. If Candle gets to "can run any HuggingFace model out of the box" by mid-2026, the Rust inference ecosystem consolidates around it. Second, Burn's training story. If Burn can demonstrate a competitive training run for a 1B-parameter model by year-end, the "Rust only for serving" consensus starts to crack. Third, the Python side is not standing still: PyTorch 2.4 and the torch.export ecosystem are narrowing the gap between training and serving in Python, which weakens one of the historical arguments for a Rust serving layer.

Our prediction, worth what predictions are worth: Rust ends 2026 as the default for high-concurrency inference serving, a plurality choice for preprocessing and tokenisation, and a niche choice for training. The language's role in the ML stack is real, but it is narrower than its advocates sometimes claim.

Further reading