Sprach-Logging holt leise zur Foto-Erfassung auf

Zwei der Top-5-Apps loggen inzwischen aus einem 4-Sekunden-Voice-Memo schneller als aus einem Foto. Der Genauigkeitsabstand schrumpft.

Marcus Holm, Senior Benchmark Engineer

Sofia Mendes · AI Product Manager

2026-01-12 · 2026-03-18

Überschrift und Zusammenfassung oben sind übersetzt; der vollständige Artikel unten erscheint vorerst auf Englisch, während wir den Fließtext lokalisieren.

For two years the assumption in this category has been that the camera wins. Photos are passive, fast, and hide the user’s own portion blindness, say “a handful of rice” out loud and you’ll lowball by a third.

That story has a wrinkle in it. In our latest UX timings, voice logging is faster than photo logging on two of the top five apps. Median voice-to-logged-entry on Welling is 3.9 seconds, faster than its own photo flow.

Why voice is closing the gap

Three things changed:

Reasoning models that can clarify in-context. “About 200 grams of chicken” no longer needs a follow-up modal. The model writes the entry and asks one clarifying question only if it needs to.
Personal priors. If the app has seen you log oats three mornings in a row, “the usual” is now a valid input. Photo flows do not benefit from this as much.
Multimodal fusion. Some apps now accept a voice note while you photograph the plate. It turns out “and there’s a side of kimchi” disambiguates the photo better than any model.

Why photos still win on portion

Voice inherits the user’s portion estimation error. Photo logging removes it. Until voice models gain access to the same depth signals, the photo flow will be more accurate for hand-estimated portions.

The end state is probably both, with the user choosing per-meal based on context.