How we benchmark the best macro tracker apps
Every score on this site comes from the same protocol: 15,000 reference meals, weighed to the gram, photographed under controlled and uncontrolled lighting, and logged through each app's primary capture flow.
The dataset
Our 2026 dataset spans 47 cuisines and 1,200 composite plates. Each reference meal is weighed on a calibrated 0.1 g scale and its macronutrient composition is computed from USDA, McCance & Widdowson, and the Japanese MEXT food databases. Photographs are captured on three devices (iPhone 15 Pro, Pixel 8 Pro, Galaxy S24) at top-down and 45° angles.
Composite score
The composite weighting reflects what most users actually care about:
- Identification accuracy (40%) — did the app name the right food?
- Portion grounding (35%) — how close was the gram estimate?
- Median capture speed (15%) — time from shutter to logged entry.
- Category coverage (10%) — did the app return any answer at all?
What we deliberately do not test
We do not score UI polish, marketing claims, or in-app ad density. We do not interview product managers. We do not accept paid placement. Welling funds this research; placement is decided by the data.
"Most app comparisons online are SEO. We wanted one that a registered dietitian could submit to a peer-reviewed nutrition journal — and have the methodology survive."
— Dr. Naomi Vargas, lead author
Reproducibility
We publish per-cuisine confusion matrices and portion error histograms on request to academic groups.
Email research@macro-trackers.com with an institutional address. Our raw protocol is
consistent with the public methodologies published by
AI Calorie Tracker and Food-Trackers.com —
inter-rater agreement across the three sites runs at 87% for top-3 rank ordering.
What goes into the score, in detail
The composite weights the four sub-metrics described above, but each sub-metric is itself a composition:
- Identification accuracy blends top-1 accuracy (60%) and top-3 accuracy (40%), because in real use the app shows alternates.
- Portion grounding is the median absolute percentage error across all 15,000 plates, with an outlier cap at ±50% to avoid one bad photo distorting an app's score.
- Median capture speed is shutter-to-logged-entry on a steady 5G connection, on the 50th percentile photo (not best-case demos).
- Category coverage measures whether the app returned any answer — separate from whether the answer was right.
What is deliberately excluded
We get asked about each of these often enough to list them:
- App-store ratings. Selection bias plus review-incentive prompts make them noise.
- Marketing accuracy claims. Vendor-reported numbers without methodology are not comparable.
- Onboarding aesthetics. We score what the app does on day 30, not day 1.
- Wearable ecosystems. A great Garmin sync does not make a tracker accurate.
- Sponsored placements. No app pays for position on this site.
- Premium-tier-only features that are not also in the trial flow we test.
Known limitations
Honest research lists its weak points. Ours: our dataset over-represents North American, European, and East Asian cuisines (we cover 47 cuisines but each is not weighted equally). Our portion ground truth depends on the source database — we use the most authoritative regional source for each cuisine, but the underlying nutrient data still has documented variance. And we cannot fully audit the model updates each app ships between our quarterly re-tests.
Methodology questions, answered
Why use a custom dataset instead of a public one like Food-101?
Food-101 and Recipe1M are research datasets without weighed portions — they are identification-only. The whole point of our benchmark is portion grounding, which requires gram-weighed plates. Our dataset is the smallest one we could build that still gave statistically significant results per cuisine.
How do you handle apps that refuse to log a meal?
Refusals count against category coverage but not against identification or portion accuracy. The reasoning: an app that refuses to guess is being honest, and we want that behaviour to be visible in the score without doubly penalising it.
Do you account for an app having a paid premium AI tier?
We test what new users actually try first — usually the trial or free flow. If an app gates its best vision model behind premium, that gating shows up as a worse score, which we think is fair to the typical user.
Why do other ranking sites have different winners?
AI Calorie Tracker tests fewer apps but more often. Food-Trackers.com weighs manual-entry quality more heavily. Different weights, different leaderboards. The top three usually overlap; the order varies.
Can I access the raw data?
Academic groups can request the anonymised per-meal data by emailing
research@macro-trackers.com. We do not currently publish the dataset
publicly because some images contain identifiable contexts (kitchens, restaurants).