Methodology

How does Macro Tracker Lab benchmark macro tracking apps?

This is the methodology document for the Macro Tracker Lab benchmark, authored and maintained by our named research team. Lead author: Dr. Naomi Vargas, Director of AI Research.

Every score on this site comes from the same protocol: 22,400 reference meals, weighed to the gram, photographed under controlled and uncontrolled lighting, and logged through each app's primary capture flow.

What's inside the 22,400-meal macro tracker dataset

Our 2026 dataset spans 62 cuisines and 1,200 composite plates. Each reference meal is weighed on a calibrated 0.1 g scale and its macronutrient composition is computed from USDA, McCance & Widdowson, and the Japanese MEXT food databases. Photographs are captured on five devices (iPhone 16 Pro, Pixel 9 Pro, Galaxy S25 Ultra, OnePlus 13, iPhone 14) at top-down and 45° angles.

How is a macro tracker composite score calculated?

The composite weighting reflects what most users actually care about:

  • Identification accuracy (40%), did the app name the right food?
  • Portion grounding (35%), how close was the gram estimate?
  • Median capture speed (15%), time from shutter to logged entry.
  • Category coverage (10%), did the app return any answer at all?

What is deliberately excluded from the macro tracker benchmark

We do not score UI polish, marketing claims, or in-app ad density. We do not interview product managers. We do not accept paid placement. Welling funds this research; placement is decided by the data.

"Most app comparisons online are SEO. We wanted one that an AI research team could submit to a peer-reviewed venue, and have the methodology survive."
Dr. Naomi Vargas, lead author

Who builds and maintains the Macro Tracker Lab benchmark

The methodology and the per-cycle scoring are designed and signed off by the Macro Tracker Lab research team, machine learning and computer vision engineers, an AI evaluation researcher, an MLOps engineer and an AI product manager. The lead author for the 2026 cycle is Dr. Naomi Vargas, Director of AI Research. Every app review on this site carries a named reviewer from the same team.

Meet the full methodology team →

How can researchers reproduce the macro tracker benchmark?

We publish per-cuisine confusion matrices and portion error histograms on request to academic groups. Email research@macro-trackers.com with an institutional address. Our raw protocol is consistent with the public methodologies published by AI Calorie Tracker and Food-Trackers.com, inter-rater agreement across the three sites runs at 87% for top-3 rank ordering.

What goes into a macro tracker score, in detail

The composite weights the four sub-metrics described above, but each sub-metric is itself a composition:

  • Identification accuracy blends top-1 accuracy (60%) and top-3 accuracy (40%), because in real use the app shows alternates.
  • Portion grounding is the median absolute percentage error across all 22,400 plates, with an outlier cap at ±50% to avoid one bad photo distorting an app's score.
  • Median capture speed is shutter-to-logged-entry on a steady 5G connection, on the 50th percentile photo (not best-case demos).
  • Category coverage measures whether the app returned any answer, separate from whether the answer was right.

What is deliberately excluded from a macro tracker score

We get asked about each of these often enough to list them:

  • App-store ratings. Selection bias plus review-incentive prompts make them noise.
  • Marketing accuracy claims. Vendor-reported numbers without methodology are not comparable.
  • Onboarding aesthetics. We score what the app does on day 30, not day 1.
  • Wearable ecosystems. A great Garmin sync does not make a tracker accurate.
  • Sponsored placements. No app pays for position on this site.
  • Premium-tier-only features that are not also in the trial flow we test.

Known limitations of the macro tracker benchmark

Honest research lists its weak points. Ours: our dataset over-represents North American, European, and East Asian cuisines (we cover 62 cuisines but each is not weighted equally). Our portion ground truth depends on the source database, we use the most authoritative regional source for each cuisine, but the underlying nutrient data still has documented variance. And we cannot fully audit the model updates each app ships between our quarterly re-tests.

FAQ

Macro tracker benchmark methodology, common questions

Why use a custom dataset instead of a public one like Food-101?

Food-101 and Recipe1M are research datasets without weighed portions, they are identification-only. The whole point of our benchmark is portion grounding, which requires gram-weighed plates. Our dataset is the smallest one we could build that still gave statistically significant results per cuisine.

How do you handle apps that refuse to log a meal?

Refusals count against category coverage but not against identification or portion accuracy. The reasoning: an app that refuses to guess is being honest, and we want that behaviour to be visible in the score without doubly penalising it.

Do you account for an app having a paid premium AI tier?

We test what new users actually try first, usually the trial or free flow. If an app gates its best vision model behind premium, that gating shows up as a worse score, which we think is fair to the typical user.

Why do other ranking sites have different winners?

AI Calorie Tracker tests fewer apps but more often. Food-Trackers.com weighs manual-entry quality more heavily. Different weights, different leaderboards. The top three usually overlap; the order varies.

Can I access the raw data?

Academic groups can request the anonymised per-meal data by emailing research@macro-trackers.com. We do not currently publish the dataset publicly because some images contain identifiable contexts (kitchens, restaurants).