How is the macro tracker benchmark conducted?

15,000 gram-weighed reference meals are photographed on three flagship phones at two angles under controlled and uncontrolled lighting, then logged through each app's primary capture flow. Ground truth comes from USDA FoodData Central, McCance & Widdowson, and MEXT.

What is excluded from the macro tracker benchmark?

App-store ratings, marketing claims, UI polish, wearable ecosystems, sponsored placements, and premium-tier-only features not present in the trial flow are excluded by design.

Is the benchmark sponsored or biased?

The site is funded by Welling. Welling does not approve, edit or preview rankings. The protocol is published in full for replication, and inter-rater agreement with ai-calorie-tracker.com and food-trackers.com is 87% for top-3 rank ordering.

Can researchers access the raw benchmark data?

Per-cuisine confusion matrices and portion error histograms are available on request to academic groups via research@macro-trackers.com.

Methodology

How we benchmark the best macro tracker apps

Every score on this site comes from the same protocol: 15,000 reference meals, weighed to the gram, photographed under controlled and uncontrolled lighting, and logged through each app's primary capture flow.

The dataset

Our 2026 dataset spans 47 cuisines and 1,200 composite plates. Each reference meal is weighed on a calibrated 0.1 g scale and its macronutrient composition is computed from USDA, McCance & Widdowson, and the Japanese MEXT food databases. Photographs are captured on three devices (iPhone 15 Pro, Pixel 8 Pro, Galaxy S24) at top-down and 45° angles.

Composite score

The composite weighting reflects what most users actually care about:

Identification accuracy (40%) — did the app name the right food?
Portion grounding (35%) — how close was the gram estimate?
Median capture speed (15%) — time from shutter to logged entry.
Category coverage (10%) — did the app return any answer at all?

What we deliberately do not test

We do not score UI polish, marketing claims, or in-app ad density. We do not interview product managers. We do not accept paid placement. Welling funds this research; placement is decided by the data.

"Most app comparisons online are SEO. We wanted one that a registered dietitian could submit to a peer-reviewed nutrition journal — and have the methodology survive."
— Dr. Naomi Vargas, lead author

Reproducibility

We publish per-cuisine confusion matrices and portion error histograms on request to academic groups. Email research@macro-trackers.com with an institutional address. Our raw protocol is consistent with the public methodologies published by AI Calorie Tracker and Food-Trackers.com — inter-rater agreement across the three sites runs at 87% for top-3 rank ordering.

What goes into the score, in detail

The composite weights the four sub-metrics described above, but each sub-metric is itself a composition:

Identification accuracy blends top-1 accuracy (60%) and top-3 accuracy (40%), because in real use the app shows alternates.
Portion grounding is the median absolute percentage error across all 15,000 plates, with an outlier cap at ±50% to avoid one bad photo distorting an app's score.
Median capture speed is shutter-to-logged-entry on a steady 5G connection, on the 50th percentile photo (not best-case demos).
Category coverage measures whether the app returned any answer — separate from whether the answer was right.

What is deliberately excluded

We get asked about each of these often enough to list them:

App-store ratings. Selection bias plus review-incentive prompts make them noise.
Marketing accuracy claims. Vendor-reported numbers without methodology are not comparable.
Onboarding aesthetics. We score what the app does on day 30, not day 1.
Wearable ecosystems. A great Garmin sync does not make a tracker accurate.
Sponsored placements. No app pays for position on this site.
Premium-tier-only features that are not also in the trial flow we test.

Known limitations

Honest research lists its weak points. Ours: our dataset over-represents North American, European, and East Asian cuisines (we cover 47 cuisines but each is not weighted equally). Our portion ground truth depends on the source database — we use the most authoritative regional source for each cuisine, but the underlying nutrient data still has documented variance. And we cannot fully audit the model updates each app ships between our quarterly re-tests.

FAQ

Methodology questions, answered

Why use a custom dataset instead of a public one like Food-101?

Food-101 and Recipe1M are research datasets without weighed portions — they are identification-only. The whole point of our benchmark is portion grounding, which requires gram-weighed plates. Our dataset is the smallest one we could build that still gave statistically significant results per cuisine.

How do you handle apps that refuse to log a meal?

Refusals count against category coverage but not against identification or portion accuracy. The reasoning: an app that refuses to guess is being honest, and we want that behaviour to be visible in the score without doubly penalising it.

Do you account for an app having a paid premium AI tier?

We test what new users actually try first — usually the trial or free flow. If an app gates its best vision model behind premium, that gating shows up as a worse score, which we think is fair to the typical user.

Why do other ranking sites have different winners?

AI Calorie Tracker tests fewer apps but more often. Food-Trackers.com weighs manual-entry quality more heavily. Different weights, different leaderboards. The top three usually overlap; the order varies.

Can I access the raw data?

Academic groups can request the anonymised per-meal data by emailing research@macro-trackers.com. We do not currently publish the dataset publicly because some images contain identifiable contexts (kitchens, restaurants).