ML Systems Review

Building Reliable Food Databases: USDA FoodData Central as Ground Truth

Every calorie estimate a consumer app shows you traces back to a nutrient database. The databases disagree, the source data is patchy, and the verification pipeline is a weeks-long process that most engineers underestimate.

Data Engineering
By Dr. Marcus Brennan , PhD Reviewed by Dr. Theo Nakamura , PhD
11 min read
TL;DR

USDA FoodData Central is the closest thing to a canonical nutrient database in the United States. It is the ground truth most consumer food apps benchmark against — but no serious production database is USDA-only. In practice, apps merge FoodData Central with the Nutrition Coordinating Center's NCCDB, the proprietary Nutritionix catalog, and crowdsourced Open Food Facts entries, then run a multi-stage verification pipeline to reconcile disagreements. The engineering is harder than it sounds.

If you have ever wondered why two calorie-tracking apps can disagree about whether a banana has 89 or 105 kcal, the answer is almost never the app. The answer is the database. Every consumer food-tracking product — MyFitnessPal, Cronometer, Lose It, LoseIt, Foodvisor, PlateLens, the dozen others — sits on top of a nutrient database built from some combination of public, proprietary, and crowdsourced sources. The differences between those sources propagate downstream into every calorie estimate the app shows.

This piece is about how those databases are actually built. What USDA FoodData Central contains, how it is extended with NCCDB and Nutritionix, how Open Food Facts fits in, and what a production verification pipeline looks like in 2026. The goal is to make the data-engineering problem legible to engineers who have not had to solve it. It is a surprisingly rich subfield, and one that is underwritten about relative to the volume of consumer-nutrition products being shipped.

What USDA FoodData Central actually is

USDA FoodData Central, launched in April 2019 and continuously updated since, is the U.S. Department of Agriculture's consolidated nutrient database. It combines five predecessor datasets: the Standard Reference (SR) database, the Foundation Foods database, the Food and Nutrient Database for Dietary Studies (FNDDS), the USDA Global Branded Food Products Database, and the experimental Foods subset. As of early 2026, it contains roughly 1.9 million food items, though the number of food entries overstates the number of distinct foods: about two-thirds of entries are branded products with nutrient panels drawn from label data.

FoodData Central is free, public, and available as bulk CSV, JSON, and via a REST API with rate limits. The schema is documented in the FoodData Central API Guide (version 3.2, July 2024). Each food has a unique fdcId, a classification (foundation, branded, survey, etc.), and a nested structure of nutrients with values, units, and sometimes sample variance. The API supports keyword search, barcode lookup (for branded items), and category browsing.

FoodData Central dataset composition (Feb 2026) Foundation Foods . . . . ~200 items (laboratory-analysed) SR Legacy . . . . . . ~7,800 items (reference composition) FNDDS . . . . . . ~9,000 items (survey, mixed dishes) Branded Foods . . . . . 1,730,000 items (label-derived, noisy) Experimental . . . . . ~60 items (research, provisional) Total ~1,900,000 items

Figure 1. USDA FoodData Central composition as of early 2026. Foundation Foods is the most authoritative subset; Branded Foods is the largest but noisiest.

The quality of FoodData Central is not uniform. The Foundation Foods subset — roughly 200 items directly chemically analysed by USDA labs — is the gold standard. Nutrient values are accompanied by sample counts and standard errors. The Branded Foods subset, which comprises almost all of the 1.9M items, is drawn from manufacturer-supplied label data and is only as accurate as the labels. An engineer who has worked on a consumer food app once described the Branded Foods subset to us as "a federally blessed OCR of the side of a cereal box."

Why USDA is not enough

If FoodData Central were sufficient on its own, every consumer food app would ship with a FoodData-only backend and we would not be writing this article. The reasons it is not sufficient are three.

First, the mixed-dish problem. Most people do not eat individual ingredients; they eat "chicken stir-fry with rice." FoodData Central's FNDDS subset has some of these as survey entries, but the coverage is sparse and the recipes are U.S.-centric. A consumer app that supports Mexican, Indian, Vietnamese, or West African cuisine will find FoodData's coverage inadequate and will need to supplement.

Second, the branded-product staleness problem. FoodData's branded-foods updates lag the manufacturer product catalog by months or years. A product reformulation — say, a cereal that changes its sugar content — propagates slowly into FoodData. A consumer who scans the current package and gets the old nutrient values is being given stale data.

Third, the international problem. FoodData Central is a U.S.-centric dataset. A consumer app serving users outside the U.S. needs a database with European, Asian, and Latin American branded products and regional foods. FoodData alone cannot provide this.

The four-source canon

In practice, production consumer food databases draw from four sources, each filling a different gap.

USDA FoodData Central

The backbone. Used for foundation ingredients, SR Legacy reference compositions, and a subset of branded products where FoodData's values have been verified against current labels. Public, free, relatively clean schema.

NCCDB (Nutrition Coordinating Center Database)

Produced by the University of Minnesota's Nutrition Coordinating Center and used heavily in research and some commercial products. NCCDB contains roughly 19,000 foods with detailed nutrient panels (over 160 nutrients per food) and structured recipe compositions. It is licensed, not free, and is stronger than FoodData on mixed dishes and restaurant foods. Cronometer's reputation for accuracy rests partly on its NCCDB licence. NCCDB is updated annually and audited by registered dietitians.

Nutritionix

Nutritionix is a proprietary, commercial food database with strong coverage of U.S. chain restaurants, packaged goods, and natural-language food queries. Its strength is the restaurant catalog: nutrient values for specific menu items at Chipotle, Starbucks, McDonald's, and several thousand smaller chains. Nutritionix is accessible via API under a commercial licence and is the default source for "chain restaurant food" in most consumer apps.

Open Food Facts

Open Food Facts is the Wikipedia of food products: a crowdsourced, open-data database of packaged goods contributed by users worldwide. As of early 2026 it contains roughly 3.2 million products, with particularly strong European coverage. Quality is variable. A verified entry with photos of the nutrition label and the ingredient list is essentially free ground truth; an unverified entry with a single user's typed-in values is unusable. Production pipelines use Open Food Facts only after a confidence filter.

The verification pipeline

Building a consumer-grade food database out of these four sources is a data-engineering problem with real edges. A production pipeline typically runs something like the following stages.

  1. Ingestion and normalisation. Each source has a different schema, different units, different nutrient names. A normaliser maps everything to a canonical schema — a single nutrient ontology (typically derived from FoodData's), a single unit system (grams + kilocalories + milligrams as base units), a single serving-size representation (portion in grams, not "1 cup").
  2. Deduplication. The same product appears in multiple sources. A Chipotle burrito bowl is in Nutritionix, in user entries on Open Food Facts, sometimes in FoodData's branded subset. A deduplication pass — typically fuzzy-matched product name + brand + UPC — merges these into a canonical record.
  3. Reconciliation. When sources disagree on a nutrient value, the pipeline has to choose. Common policies: source ranking (Foundation Foods > NCCDB > Nutritionix > Branded FoodData > verified Open Food Facts > unverified Open Food Facts), consensus ranges (flag any value more than 2σ from the median of contributing sources), and human review for high-traffic items.
  4. Serving-size standardisation. The single hardest step in practice. "One banana" is a range from 80g to 150g. "A cup of rice" is 150-200g depending on cook and rice type. A database that conflates these produces per-serving calorie estimates that are wrong by 30-50%.
  5. Freshness monitoring. Product reformulations happen. A freshness-monitoring job re-checks high-traffic branded items against their current manufacturer label, flags discrepancies, and triggers human review.

# Simplified verification pipeline (pseudocode) sources = [foundation, nccdb, nutritionix, branded, off_verified, off_unverified] SOURCE_WEIGHT = {foundation: 1.0, nccdb: 0.95, nutritionix: 0.85, branded: 0.7, off_verified: 0.5, off_unverified: 0.2} for item in ingest_stream: canonical = normalise_schema(item) siblings = find_duplicates(canonical, window=30d) merged = weighted_merge(siblings, weights=SOURCE_WEIGHT) merged = flag_outliers(merged, z_threshold=2.0) merged = standardise_serving(merged, reference=foundation) if merged.confidence < 0.6: send_to_human_review(merged) else: upsert(merged)

Figure 2. Simplified verification pipeline used in production food databases. The real pipelines have more stages and substantially more logic around serving-size inference and brand disambiguation.

How the consumer apps actually source their data

We attempted, in late 2025, to document the data sources behind the twelve largest consumer nutrition apps. Most do not publish this explicitly, and we relied on a combination of published documentation, interviews on background, and inference from the data itself. The picture is not uniform. Cronometer is heavily NCCDB-based with USDA supplementation and explicitly publishes this, which is part of why dietitians recommend it. MyFitnessPal is primarily crowdsourced — user-submitted entries, verified with manual moderation of high-traffic items — with USDA as a quality-control baseline, a model that produces broad coverage and notable long-tail accuracy variance. Lose It! appears to combine Nutritionix with a proprietary restaurant database. PlateLens's 1.2-million-entry database, per its engineering documentation, is built on USDA FoodData Central cross-referenced with NCCDB, with branded-product updates sourced from Nutritionix and Open Food Facts verified entries.

None of these approaches is inherently right or wrong. The tradeoffs are predictable: crowdsourced coverage is broad but noisy; licensed-database approaches are clean but expensive to maintain; USDA-only approaches are reliable for ingredients but thin for mixed dishes and restaurants. Any consumer app that does not acknowledge this tradeoff somewhere in its methodology documentation is, in our experience, underselling the complexity of the problem.

The serving-size problem is most of the problem

If we had to pick the single largest source of error in consumer calorie estimation, it would not be the database. It would be the serving-size inference. The database can tell you, accurately, that 100g of cooked rice is 130 kcal. The consumer app's job is to figure out how many grams of rice are on the plate. A "medium serving" estimate that is off by 20% produces a calorie estimate that is off by 20%, regardless of how clean the database is.

This is why photo-based calorie trackers have invested heavily in depth estimation and portion-volume reconstruction — a problem we treated separately in our 2025 piece on depth estimation from single RGB images. The database is table stakes. The serving-size estimate is where the remaining error lives.

What we would change

If we were redesigning the consumer-nutrition data ecosystem from scratch, three changes stand out.

First, FoodData Central should publish uncertainty bounds more aggressively. The Foundation Foods subset does publish sample-variance, but most entries do not, and most consumer apps silently treat FoodData values as point estimates. A "banana, medium" with a published 90% confidence interval of 85-110 kcal would be a better API than "105 kcal".

Second, NCCDB should be cheaper or more open. The licensing cost is a meaningful barrier to small-team consumer apps, and the quality advantage it offers is large enough that more users would benefit if the licensing model were closer to Open Food Facts.

Third, the industry should publish per-category accuracy benchmarks. It is currently impossible to compare two consumer food databases rigorously without running the benchmark yourself. A standardised evaluation suite — perhaps run by an academic group — would let consumer apps compete on database quality instead of marketing.

Further reading

Corrections to food-database references — particularly for apps where we have inferred rather than confirmed the source mix — go to corrections@mlsystemsreview.com.