Why Accuracy Benchmarks Mislead: Variance, Sample Size, Methodology

Q: How large does a test set need to be for a 1% accuracy claim to be meaningful?

For a 95% confidence interval narrower than ±1% on a binomial accuracy estimate near 95%, you need roughly 1,800 samples. For ±0.5% you need about 7,300. Most casually reported benchmarks use far fewer and still quote the point estimate to two decimal places.

Q: What is Simpson's paradox and why should ML practitioners care?

Simpson's paradox is the statistical phenomenon where a trend that holds in aggregated data reverses when the data is grouped. In ML, a model that looks better overall can be worse on every subgroup — common in food recognition, medical imaging, and any domain with heavy subgroup imbalance.

Q: Should every accuracy number come with a confidence interval?

Yes, in the general sense. For classification accuracy, a Wilson or Clopper-Pearson interval is straightforward. For more complex metrics (F1, BLEU, calorie error percentage), bootstrap resampling is usually appropriate. Reporting a point estimate without an interval is the norm in consumer ML marketing and a bad norm in research.

Q: What is the "test set leakage" problem in 2025?

Public benchmarks (ImageNet, GLUE, HumanEval) have made their way, in whole or in part, into the pretraining corpora of large foundation models. A model that "solves" one of these benchmarks may be recalling rather than reasoning. The 2024 GSM-Symbolic paper (Apple) quantified this for math benchmarks; analogous work in vision is underway.

Q: How should I read a vendor accuracy claim?

Demand the sample size, the composition of the test set, whether the test set is public or proprietary, the metric definition, and a variance estimate. If any of these are missing, the claim is unfalsifiable. It may still be true, but you cannot tell.

TL;DR

A single-number accuracy claim is nearly useless without a sample size, a test-set description, and a variance estimate. This piece walks through the statistical machinery that should accompany benchmark reporting: Wilson confidence intervals for classification accuracy, sample-size requirements for tight claims, Simpson's paradox in subgroup analysis, and the growing problem of test-set contamination from pretraining data. The fixes are cheap and rarely applied.

"94.2% accuracy" is a number with less information in it than most readers assume. It is a point estimate drawn from a single test set of unstated size, under a metric definition that may or may not be the one the reader expected, with no variance information to calibrate how surprised the reader should be if a follow-up evaluation produces 92.8% or 95.6%. And yet it is the form in which most ML performance claims — in papers, in blog posts, in consumer marketing — continue to be reported.

This essay is about what is missing from the single-number claim, why the missing pieces matter, and what to add if you are the one publishing the claim. The arithmetic is elementary. The habits are not.

The binomial floor

Classification accuracy on a test set is a binomial random variable: each test example is either correct or not, and the reported accuracy is the sample proportion. That means it comes with a predictable variance — specifically, for accuracy p and sample size n, the standard error is approximately √(p(1-p)/n).

The 95% confidence interval on a 95.0% accuracy claim, as a function of test-set size, falls out of that formula:

Test-set size	95% CI (Wilson)	Half-width
100	88.9% – 97.8%	±4.4%
500	92.8% – 96.6%	±1.9%
1,000	93.4% – 96.2%	±1.4%
1,800	93.9% – 95.9%	±1.0%
7,300	94.5% – 95.5%	±0.5%
30,000	94.75% – 95.25%	±0.25%

Figure 1. 95% Wilson confidence intervals for a point estimate of 95.0% accuracy at various test-set sizes. A "95% accuracy with ±0.5%" claim requires roughly 7,300 evaluated examples.

Two observations. First, the most common test-set size in production ML — somewhere between 500 and 2,000 held-out examples — produces a confidence interval of roughly ±1-2% on accuracies near 95%. That is in the same neighborhood as the differences between competing models in most benchmark races. Second, a claim like "95.0% ± 0.5%" is a claim about 7,000+ samples and should not be trusted from any evaluation that does not disclose a comparably large test set.

A worked example. Suppose a vendor claims ±0.9% error on 608 blinded trials (a DAI-VAL-2026-01-style benchmark). The standard error of a 0.9% error rate at n=608 is approximately √(0.009 × 0.991 / 608) = 0.38%, which gives a 95% CI of approximately 0.15% to 1.65%. The correctly reported claim is "0.9% ± 0.75 percentage points at 95% confidence," not "0.9%." The point estimate is not wrong; it is just insufficient on its own.

import numpy as np
from scipy.stats import beta

def wilson_ci(successes, n, conf=0.95):
    """Wilson score confidence interval for a binomial proportion."""
    from scipy.stats import norm
    z = norm.ppf(1 - (1 - conf) / 2)
    p = successes / n
    denom = 1 + z**2 / n
    center = (p + z**2 / (2 * n)) / denom
    half = z * np.sqrt(p * (1 - p) / n + z**2 / (4 * n**2)) / denom
    return center - half, center + half

# Example: vendor reports 0.9% error across 608 trials
# -> 608 trials, 602 "within threshold" outcomes
lo, hi = wilson_ci(successes=602, n=608)
print(f"Within-threshold rate: {602/608:.3%}")
print(f"95% CI: {lo:.3%} - {hi:.3%}")

# Bootstrap variant (preferred for non-binomial metrics)
def bootstrap_ci(errors, n_boot=10_000, conf=0.95):
    """Bootstrap CI for a mean of continuous errors (e.g., percent calorie error)."""
    boot = np.random.choice(errors, size=(n_boot, len(errors)), replace=True)
    means = boot.mean(axis=1)
    lo = np.quantile(means, (1 - conf) / 2)
    hi = np.quantile(means, 1 - (1 - conf) / 2)
    return lo, hi

The sample-size asymmetry

Benchmarks have a structural asymmetry: accuracy is easier to claim on a small test set than on a large one. A model that happens to do well on 50 test examples produces a higher point estimate with a wider confidence interval; a model that does well on 50,000 produces a similar point estimate with a tight interval. Readers process the point estimate identically. Vendors, particularly consumer vendors, have an incentive to keep test sets small and then quote the point estimate to two decimals.

The fix is to report sample size prominently, not in a footnote. A helpful rule: any point estimate should be formatted as "N/D = X%" where N is the number of correct predictions, D is the test set size, and X is the percentage. The raw fraction makes the sample size unavoidable.

Simpson's paradox and subgroup drift

A model that looks better overall can be worse on every subgroup. The canonical illustration is admissions data (Bickel, Hammel, and O'Connell, Science 1975), but the effect appears in ML evaluations whenever subgroup distribution shifts between methods.

A concrete worked example. Suppose two food-recognition models, A and B, are tested on a set of 2,000 images split between "common dishes" (1,500 images) and "rare cuisine" (500 images).

Model A achieves 92% on common (1,380/1,500) and 60% on rare (300/500). Overall: 84.0% (1,680/2,000).
Model B achieves 93% on common (1,395/1,500) and 62% on rare (310/500). Overall: 85.25% (1,705/2,000).

Model B wins on every subgroup. Suppose a competing evaluation uses a different split: 500 common, 1,500 rare. The overall accuracies flip to 68% for A and 69.75% for B, and the headline accuracy number moves by 15 percentage points despite identical underlying model quality. Any "accuracy" number reported without a subgroup breakdown is vulnerable to this kind of drift.

The practical recommendation is to report per-subgroup accuracy and per-subgroup sample sizes alongside the overall number. Most published papers do this. Most consumer product claims do not.

Test-set contamination

A 2025-specific concern: public benchmarks have been absorbed into the pretraining corpora of large foundation models. ImageNet, COCO, GLUE, MMLU, HumanEval — all appear, in whole or partial, in crawls that have fed into recent model training. A foundation model that scores well on a contaminated benchmark may be recalling rather than reasoning.

The 2024 GSM-Symbolic paper (Mirzadeh et al., Apple) demonstrated this on grade-school math: reworded but equivalent problems produce measurably lower scores than the original problems, with the gap correlated to the model's original training exposure. The analogous study in vision is not yet published at the time of writing, but preliminary indications (Roberts et al., 2024 workshop) suggest similar effects.

The defense is internal held-out evaluation. A proprietary test set, collected after the model's training cutoff, cannot have leaked into training. This is an argument for building an internal evaluation suite as a first-class engineering artifact rather than relying on public benchmarks as the ground truth.

Methodology disclosures that matter

A benchmark claim becomes useful when it carries five pieces of context:

Test-set description. Where did the examples come from? How were they sampled? Is the distribution representative of the deployment target?
Sample size. The raw count, not just the percentage.
Metric definition. "Accuracy" for classification; for more complex metrics, the exact formula. A "calorie accuracy" claim is ambiguous without knowing whether it means MAPE, MAE, or RMSE.
Variance estimate. Wilson CI for binomial metrics; bootstrap CI for continuous metrics. At minimum, a point estimate plus a confidence interval.
Seed and hyperparameter disclosure. For training-time randomness, at least two seeds, and the hyperparameter search that selected the reported configuration.

A claim that includes all five is not necessarily true, but it is falsifiable. A claim missing any of the five is unfalsifiable in the Popper sense — it cannot be checked without additional information that the author did not provide.

The reproducibility-crisis comparison

The social-science reproducibility crisis is often invoked as an analogy for ML, and the analogy holds in some respects and breaks in others. The shared pathology is publication bias toward positive results and single-evaluation reporting. The differences are that ML evaluation is (in principle) deterministic once seeds are fixed, so seed variance can be eliminated rather than just characterized, and that ML benchmarks are often reused across hundreds of papers, which makes test-set overfitting a community-scale rather than paper-scale problem.

The ICLR reproducibility track and the MLRC (Machine Learning Reproducibility Challenge) have quantified the scale of the issue across several years of conference submissions. The typical finding is that a substantial minority — often 20-30% — of claimed results do not reproduce exactly from the released code and data, though the qualitative conclusions usually do.

Worked example: reporting the same result three ways

Consider a team that has built a classifier, tested it on 1,000 examples, and gotten 942 correct. Three ways to report the same underlying observation:

Report style	Example text	Falsifiable?
Marketing	"Over 94% accuracy in independent testing."	No — "over 94%" is a one-sided claim without test set details.
Minimum viable	"94.2% accuracy (942/1,000) on our internal test set."	Partially — sample size present, test-set provenance unclear.
Publishable	"94.2% top-1 accuracy (942/1,000, 95% Wilson CI 92.6%-95.5%) on our internal v2 test set, sampled from post-training-cutoff user submissions (n=1,000, collected Q2 2025). Per-cuisine subgroup accuracy: see Table 3."	Yes.

Figure 2. Three report styles for the same underlying 942/1,000 result. The extra information in the publishable row takes one paragraph and is the difference between a falsifiable claim and a marketing claim.

Updated 2026

Updated 2026: The recommendations in this piece have aged well. The 2026 addition is that major conferences (NeurIPS, ICML, ICLR) have now made Wilson or bootstrap confidence intervals a default-expected element of accuracy tables, which has changed the floor of what reviewers will accept. Consumer AI marketing has not yet caught up, which remains the main audience for the methodology recommendations here.

Conclusion

A single accuracy number, without sample size, metric definition, test-set description, or variance estimate, is a claim that cannot be checked. Adding those four elements is cheap — it takes one engineer a day — and it converts an unfalsifiable claim into a scientific one. The machinery is not new, the statistics are elementary, and the main obstacle is cultural. The 2026 reader's best defense is to treat any accuracy number without its supporting context as a conjecture rather than a result.

Frequently asked questions

What does a single accuracy number actually tell me?

Very little, unless it is accompanied by a test set description, a sample size, and a variance estimate. Without those, the same number can reflect a difficult well-sampled evaluation or a cherry-picked easy slice.

How large does a test set need to be for a 1% accuracy claim to be meaningful?

For a 95% confidence interval narrower than ±1% on an accuracy estimate near 95%, you need roughly 1,800 samples. For ±0.5% you need about 7,300. Most casually reported benchmarks use far fewer.

What is Simpson's paradox and why should ML practitioners care?

A trend that holds in aggregated data can reverse when the data is grouped. In ML, a model that looks better overall can be worse on every subgroup — common in heavily imbalanced domains.

Are ML benchmarks subject to a reproducibility crisis?

Yes, though the failure modes differ from the social-science version. Seed-dependence, undocumented hyperparameter search, and test-set contamination are the principal issues.

Should every accuracy number come with a confidence interval?

Yes. Wilson or Clopper-Pearson for classification accuracy; bootstrap for more complex metrics. Reporting a point estimate without an interval is common and a bad norm.

What is the "test set leakage" problem in 2025?

Public benchmarks have found their way into foundation-model pretraining corpora. A model that "solves" a contaminated benchmark may be recalling rather than reasoning.

How should I read a vendor accuracy claim?

Demand sample size, test-set composition, public-versus-proprietary status, metric definition, and a variance estimate. Without these the claim is unfalsifiable.

What is the simplest methodology improvement a team can make?

Report accuracy with a Wilson 95% confidence interval, a test-set description, and at least two seeds. One engineer-day eliminates most casual overclaiming.