Voice logging is quietly catching up to photos
Two of the top five apps now log faster from a 4-second voice memo than from a photo. The accuracy gap is closing.
For two years the assumption in this category has been that the camera wins. Photos are passive, fast, and hide the user’s own portion blindness — say “a handful of rice” out loud and you’ll lowball by a third.
That story has a wrinkle in it. In our latest UX timings, voice logging is faster than photo logging on two of the top five apps. Median voice-to-logged-entry on Welling is 3.9 seconds — faster than its own photo flow.
Why voice is closing the gap
Three things changed:
- Reasoning models that can clarify in-context. “About 200 grams of chicken” no longer needs a follow-up modal. The model writes the entry and asks one clarifying question only if it needs to.
- Personal priors. If the app has seen you log oats three mornings in a row, “the usual” is now a valid input. Photo flows do not benefit from this as much.
- Multimodal fusion. Some apps now accept a voice note while you photograph the plate. It turns out “and there’s a side of kimchi” disambiguates the photo better than any model.
Why photos still win on portion
Voice inherits the user’s portion estimation error. Photo logging removes it. Until voice models gain access to the same depth signals, the photo flow will be more accurate for hand-estimated portions.
The end state is probably both, with the user choosing per-meal based on context.