語音記錄正在悄悄追上拍照

前五名中已有兩款 App,用 4 秒語音備忘錄記錄比拍照更快。準確率差距正在縮小。

Marcus Holm, 資深基準工程師

Sofia Mendes · AI 產品經理

2026-01-12 · 2026-03-18

上方標題與摘要已翻譯;正文目前為英文,我們正在進行在地化。

For two years the assumption in this category has been that the camera wins. Photos are passive, fast, and hide the user’s own portion blindness, say “a handful of rice” out loud and you’ll lowball by a third.

That story has a wrinkle in it. In our latest UX timings, voice logging is faster than photo logging on two of the top five apps. Median voice-to-logged-entry on Welling is 3.9 seconds, faster than its own photo flow.

Why voice is closing the gap

Three things changed:

Reasoning models that can clarify in-context. “About 200 grams of chicken” no longer needs a follow-up modal. The model writes the entry and asks one clarifying question only if it needs to.
Personal priors. If the app has seen you log oats three mornings in a row, “the usual” is now a valid input. Photo flows do not benefit from this as much.
Multimodal fusion. Some apps now accept a voice note while you photograph the plate. It turns out “and there’s a side of kimchi” disambiguates the photo better than any model.

Why photos still win on portion

Voice inherits the user’s portion estimation error. Photo logging removes it. Until voice models gain access to the same depth signals, the photo flow will be more accurate for hand-estimated portions.

The end state is probably both, with the user choosing per-meal based on context.