Benchmarks — ML Systems Review

Benchmarks

Benchmark-focused articles with documented methodology. When we report a number, we also report how we got it — sample sizes, hardware, variance, and the caveats the marketing usually omits.

Apple M4 Max first NPU benchmarks: tflops per watt analysis

ViT-L/16 forward-pass latency and tflops-per-watt measurements on the M4 Max 38 TOPS Neural Engine, with M3 Max and RTX 4090 comparisons.

By Lukas Berg · April 16, 2026
Inside PlateLens's Calorie-Accuracy Claim: A Technical Replication

End-to-end accuracy benchmark against MyFitnessPal, Cronometer, Foodvisor, and Bitesnap on 418 professionally plated meals.

By Dr. Marcus Brennan · February 12, 2026
On-device vs cloud inference: per-million-inference cost across six targets

Cost, battery, and latency measurements for ANE, Hexagon, Tensor G3, Inferentia2, TPU v5e, and L4.

By Lukas Berg · March 5, 2026
Rust serving p50/p99 vs Python: a tokenizer and inference overhead benchmark

p50 and p99 request overhead for FastAPI, axum + tch-rs, and axum + Candle serving stacks.

By Lukas Berg · February 10, 2026
Edge ML inference: iPhone vs Android TFLite benchmarks

Latency and accuracy across Core ML and TFLite paths on iPhone 14 Pro and Pixel 7 Pro.

By Dr. Marcus Brennan · July 18, 2024
Production-scale vision transformers: cost per inference in 2025

ViT-B, ViT-L, and ViT-H cost per million inferences across AWS, GCP, and on-premises hardware.

By Priya Ramachandran · August 22, 2025
Depth estimation from single RGB images: state of 2025

Benchmark comparison of MiDaS 3.1, ZoeDepth, DepthAnything, and Marigold on NYU Depth V2 and in-the-wild datasets.

By Dr. Marcus Brennan · May 10, 2025
Why accuracy benchmarks mislead: variance, sample size, methodology

How to read published accuracy numbers and what to check before trusting a benchmark.

By Dr. Nadia Volkov · December 1, 2025