Why dictation failed

Dictation has been five years away for thirty years.

Every decade or so, someone announces that voice is about to replace the keyboard. It never does. The prediction isn't wrong exactly; it's just been waiting on two things to become true at the same time, and until recently, neither was.

For most of its history, dictation required you to meet the technology halfway. Speak slowly. Enunciate. Train the model on your voice for hours before it could handle basic sentences. Correct errors constantly. The effort of using dictation often exceeded the effort of just typing, which defeated the entire point.

The same large language model breakthroughs behind everything else happening in AI right now also made speech-to-text good enough that you can speak at full conversational speed, with your actual accent, and get usable output. The latency dropped to about a second.

The theoretical speed advantage of voice was always there: people speak at roughly 180 words per minute and type at about 45. But a 4x speed advantage only matters when accuracy is high enough that you don't spend the time savings on corrections. Most people can tell you when dictation started understanding them. The more interesting question is when they started trusting it enough to hit send without re-reading. Those are very different moments, and only the second one changes behavior.

But even after accuracy got good, most people who tried dictation still stopped using it. The reason was how the output looked.

A raw transcript has no punctuation, no paragraph breaks, no capitalization, no structure. It reads like someone dumped their stream of consciousness into a text field. And people are surprisingly sensitive to this; we judge the quality of someone's thinking by how their writing looks, even when we know that's irrational. Nobody wants to send an email that reads like they dictated it in a parking lot. The formatting problem was a social problem as much as a technical one.

LLMs solved this too, but applied differently: taking raw speech and adding punctuation, paragraph structure, and capitalization without rewriting the words. The key is that it formats what you said rather than replacing it with what a model would have said.¹

Both breakthroughs happened almost simultaneously. Better transcription and better formatting of the output. That's why dictation seemed to go from a punchline to something people actually use overnight.

With Epilude we've been building a dictation tool that does both: transcribes at conversational speed and formats the output before it hits the text field. The whole pipeline takes about a second. We noticed something after a few weeks of using it; we'd stopped letting emails sit. Messages we would have read and flagged for later, we were just answering, because the effort of responding had dropped below whatever threshold was stopping us before. We were writing longer replies to people, too. Not because we had more to say, but because we'd always had more to say and typing it felt like too much work.

The future that was always five years away may have quietly arrived. Most people haven't noticed yet. When they do, I think we'll find that people have a lot more to say than they were willing to type.

Why dictation failed

Related

Epilude is now open to everyone