ML Systems Review

DeepSeek-V3.5 Paper: What's Actually Novel

DeepSeek's V3.5 release note claims substantial gains over V3 and V3.1. A careful reading separates the genuinely new contributions from the repackaged ones — routing changes, efficiency gains, RLHF tweaks, and the parts that amount to more training, not new technique.

Model Architecture
By Dr. Marcus Brennan , PhD Reviewed by Dr. Theo Nakamura , PhD
8 min read

DeepSeek released V3.5 in early April 2026, accompanied by a 71-page technical report and an open weights drop under the same permissive license as V3. The marketing framing positions V3.5 as a major step beyond V3.1 — improved reasoning, better code synthesis, lower inference cost. A careful read separates the substantive architectural contributions from the parts that amount to additional training on top of the V3 family. The novel contributions are real, but narrower than the release note suggests.

V3.5 is a 685B-parameter mixture-of-experts model with 41B active parameters per token, a small increase over V3's 37B active. Training used the same 8K-H800 cluster configuration as V3 at an estimated 3.1M GPU-hours — roughly 1.4x V3's compute budget. Benchmark gains over V3.1 are most pronounced on code (+4.8 points on HumanEval+, +6.1 on LiveCodeBench), mathematics (+3.9 on MATH-500), and long-context retrieval, with more modest gains on general knowledge benchmarks.

What is genuinely new

Five contributions in the V3.5 paper are, in our reading, materially new relative to V3 and V3.1. Two are architectural, two are training-side, and one is a deployment artifact.

  • Auxiliary-loss-free load balancing with expert drift correction. V3 introduced the no-auxiliary-loss routing scheme that used a bias term to balance expert load. V3.5 extends this with an online expert drift correction: the per-expert bias is decomposed into a stable component and a short-horizon corrective component, which is reset on each evaluation interval. This reduces a specific failure mode observed in long V3 training runs, where a small number of experts would accumulate disproportionate gradient updates and drift into near-duplicates of each other. The correction is described in §3.2 of the report and is, to our knowledge, the first published treatment of this drift problem in auxiliary-loss-free MoE training.
  • Multi-token prediction with staggered-horizon heads. V3 shipped a single-token-lookahead MTP head, trained with a shared backbone. V3.5 extends the MTP formulation to three staggered heads predicting tokens at offsets +1, +2, and +4, with a loss schedule that weights the near-horizon head more heavily early in training and gradually admits the longer-horizon heads. The paper reports a 1.8x speculative-decoding acceptance rate under greedy verification, up from V3's 1.4x. The idea itself is not unprecedented (Medusa and EAGLE both explored multi-head speculative decoding), but the staggered-horizon training schedule is new.
  • FP8-native attention with per-block descaling. V3's attention was computed in BF16 with FP8 matrix multiplies for the projections. V3.5 moves the full attention dot-product to FP8, with a per-block descaling scheme that bounds the cumulative error within an empirically determined envelope. The paper is admirably honest about when this falls over (long sequences with pathological score distributions) and ships a fallback. The measured wall-clock speedup on the training cluster is 18% end-to-end.
  • Constitutional RLHF with a critic-of-critics layer. The RLHF stack adds a meta-critic that scores the primary critic's judgments against a small hand-curated reference set. This is a direct response to the V3.1 post-mortem, which acknowledged that the reward model exhibited systematic biases against verbose answers. The meta-critic approach is a recognisable lineage from Anthropic's constitutional AI work, but the implementation details — in particular, the reference-set curation protocol — are specific and reproducible.
  • A distilled 23B-active dense variant shipped alongside the MoE. DeepSeek released a distilled dense model in the same drop, which is new as a release practice if not as a research technique. The distillation targets inference-cost parity with Llama 3.3 70B while retaining most of V3.5's reasoning benchmark performance. Early third-party reproductions put the distilled model within 2.3 points of V3.5 on average across MMLU, GSM8K, and HumanEval.

What is repackaged

Several contributions framed as new in the release note are, on close reading, incremental tuning of V3.1. The multi-head latent attention scheme (MLA) is unchanged from V3. The YaRN-based long-context extension is unchanged. The tokenizer is unchanged. The 128K base context length and 1M extended context length were both already available in V3.1. The claim of "improved reasoning" on MMLU is a 1.1-point gain that is within the variance reported elsewhere in the paper.

A subtler point: the efficiency gains on inference cost attributed to V3.5 are roughly 60% architectural (FP8 attention, improved MTP) and 40% operational (better kernel fusion, updated vLLM integration, SGLang tuning). The operational gains are worth having but are not properties of the model itself.

Comparison to V3 and V3.1

Property V3 V3.1 V3.5
Total params671B671B685B
Active params37B37B41B
Training GPU-hours2.2M2.6M3.1M
MTP horizons111, 2, 4 (staggered)
Attention precisionBF16BF16FP8 (block-descaled)
Speculative acceptance1.4x1.5x1.8x
HumanEval+78.281.486.2
MATH-50087.189.092.9
MMLU88.589.190.2

V3 and V3.1 numbers from respective DeepSeek technical reports. V3.5 numbers from the April 2026 release.

Reservations

Two reservations. First, the FP8 attention story is load-bearing for the efficiency claims, and the paper's failure-mode discussion, while candid, treats the fallback as rare; third-party users running V3.5 on non-DeepSeek inference stacks should measure rather than assume. Second, the distilled dense model's benchmark parity with the MoE is strong but was evaluated on the same benchmark suite DeepSeek has used since V2. Independent evaluations on held-out benchmarks — particularly long-context and tool-use ones — are still incoming.

None of this undercuts the reading that V3.5 is a substantive release. The auxiliary-loss-free drift correction and the staggered MTP scheme are each the kind of contribution that other labs will cite in their own reports. V3.5's position in the open-weight frontier is, on our reading, unambiguous.

Further reading