The llama.cpp 2026 Rewrite: What Changed in the Inference Engine

Georgi Gerganov's llama.cpp landed its largest architectural rewrite since the project's original 2023 release, merged to main in early April 2026. The rewrite touches three layers at once — a new kernel generator that replaces most of the hand-rolled SIMD intrinsics, a reorganised KV cache that is now contiguous per attention head, and a unified backend dispatch that collapses the previously parallel Metal and CUDA code paths into a single op-graph lowering pass. The practical effect is a 2.1x end-to-end throughput improvement on 70B quantised models and roughly 1.4x on 7B, measured on an Apple M3 Ultra and an Nvidia H100 respectively.

The rewrite is notable not just for the numbers but because of how it was executed. llama.cpp has always prized its small surface area — a single-header C library, minimal dependencies, and code paths that a mid-career systems engineer can read on a plane. The new version preserves that philosophy. The backend unification removes roughly 11,000 lines of duplicated intrinsics without introducing a heavy IR layer, which is the sort of thing that goes uncommented in most commit histories but is the actual feat of engineering here.

The kernel generator

The old llama.cpp was a repository of hand-written kernels: AVX2, AVX-512, NEON, Metal MSL, CUDA, ROCm HIP, and an increasingly long tail of conditionally compiled variants for specific operator/precision pairs. The new kernel generator replaces most of these with a compact template layer that emits per-backend code from a shared operator description. The generator is written in plain C and produces output at build time; there is no runtime JIT.

The practical win is not that any individual kernel is faster — many of the hand-rolled AVX-512 paths were already optimal — but that the matrix of (operator × precision × backend) now composes correctly. Before the rewrite, the INT4 Q4_K_M quantisation scheme had excellent AVX-512 performance, good Metal performance, and middling CUDA performance; the rewrite closes that gap because the same descriptor drives all three emissions.

KV cache rearrangement

The KV cache has been reorganised from a layer-interleaved layout to a head-contiguous layout. The practical implication: attention reads are now coalesced across the head dimension, which is the natural access pattern for the dot-product attention kernel. On long-context inference (8K+ tokens), the old layout was spending a meaningful fraction of time on uncoalesced reads on both Metal and CUDA; the new layout removes that bottleneck.

The rearrangement also allows a small but useful set of new optimisations. KV cache eviction for rolling-context inference (the "sliding window" pattern used for chat servers running indefinite sessions) can now drop entire head-contiguous slabs rather than having to rewrite the layer-interleaved layout in place. The 2.1x throughput gain on 70B quantised models comes disproportionately from this change; short-sequence inference on 7B models sees a smaller effect because the cache layout matters less when the cache is small.

Backend unification

Before the rewrite, the Metal and CUDA backends were two parallel trees of code. A new operator had to be implemented twice; a precision change had to be plumbed through both. The unified backend introduces a small op-graph lowering pass — essentially, the model forward pass is described as a graph of operator nodes, and each backend consumes the same graph. The lowering is done in plain C, not in a dependency-heavy IR framework, which is consistent with the project's philosophy.

The secondary benefit is that new backends become easier to add. A community ROCm backend has been in progress for several releases; the unified dispatch shortens the path to upstream inclusion meaningfully. Vulkan support is also expected to benefit, though as of the April 2026 release it remains gated behind a feature flag.

New API surface

The new API is narrower and explicitly non-allocating at the hot path. Most of the previous public struct fields are now opaque; allocation is hoisted to session creation. The canonical inference loop now looks close to this:

#include "llama.h"

llama_model* model = llama_model_load_from_file("llama-3.3-70b.Q4_K_M.gguf", params);
llama_context* ctx = llama_new_context_with_model(model, ctx_params);

llama_batch batch = llama_batch_init(n_tokens, 0, 1);
for (int i = 0; i < n_tokens; i++) {
    llama_batch_add(&batch, tokens[i], i, { 0 }, false);
}
batch.logits[n_tokens - 1] = true;

if (llama_decode(ctx, batch) != 0) {
    return -1;
}

const float* logits = llama_get_logits_ith(ctx, n_tokens - 1);
llama_token next = llama_sampler_sample(sampler, ctx, n_tokens - 1);

llama_batch_free(batch);
llama_free(ctx);
llama_model_free(model);

The shape is familiar, but the internals are different: llama_decode now dispatches through the unified op-graph, llama_batch is cache-layout-aware, and the sampler API has been split from the context API so samplers can be composed independently. Existing binaries using the 0.x API will not link without edits, but the migration is mostly mechanical.

Measured speedups

Model	Quantisation	Hardware	Old (tok/s)	New (tok/s)	Speedup
Llama 3.3 70B	Q4_K_M	M3 Ultra	14.1	29.6	2.10x
Llama 3.3 70B	Q4_K_M	H100 80GB	68.4	132.7	1.94x
Llama 3.1 8B	Q4_K_M	M3 Ultra	61.2	84.9	1.39x
Mistral 7B	Q4_K_M	H100 80GB	118.2	165.1	1.40x
Llama 3.3 70B	Q8_0	H100 80GB	41.7	72.3	1.73x

Single-request decode throughput, 2K prompt, 512-token generation, batch size 1.

What has not changed

GGUF remains the canonical model format and is backward-compatible; no re-quantisation is required to benefit from the rewrite. The project's build story remains a single make invocation with no external dependencies beyond a C compiler and the platform GPU SDK. The contributor bar — small, well-reasoned patches, clear performance numbers — has not moved. The rewrite is a structural refactor rather than a rebrand.

For teams running llama.cpp in production, the practical advice is straightforward: upgrade, re-benchmark, and budget a half-day for API migration if you are consuming the library rather than the binary. The gains are real, and the surface area has actually shrunk.

What the 2026 llama.cpp Rewrite Actually Changed