As frustrated as you might get that voice-controlled tools like Apple Siri and Microsoft Cortana don’t always understand you, it used to be a lot worse.
Earlier this year, Google announced it had gotten its speech recognition error rate down to 8%.
But Microsoft Distinguihsed Engineer and Chief Scientist of Speech Xuedong Huang says that it’s a vast improvement.
When Microsoft made its first-ever speech recognition technology available alongside Windows 95, a project Huang headed up, the error rate was “almost 100%,” he says.
If you chart it out, Huang says, that means that on average, speech recognition has gotten 20% better every single year for the last twenty years. Which means that the end is in sight.
“In the next four to five years, computers will be as good as humans” at understanding the words that come out of your mouth, Huang says.
But for Huang, Microsoft, and the tech world in general, the end of this road is the beginning of the next phase: building real artificial intelligence.
With “total parity” between human and computer understanding on the visible horizon, Huang says, it means that the world of speech science has a firmer foundation on which to work on giving computers actual artificial intelligence.
“To understand a word is easier than understanding the context,” Huang says.
But with tools like Microsoft Cortana, Google Now, Apple Siri, and Amazon Alexa, we have consumer-facing apps that are slowly but surely getting better at figuring out not only what you said — but also what you meant. It means that you can start to have more complex conversations with your gadgets.
This means that we’re on the cusp of an “invisible revolution,” Huang says, as speech becomes an accepted and useful interface for computers, and artificial intelligence becomes a reality.
It’s something that’s been a long time coming for Microsoft. Witness Bill Gates demonstrate the Microsoft MiPad (seriously), a device running prototype voice recognition software created by Huang’s team, at the 2001 Consumer Electronics Show:
The MiPad never came to market. But the world of speech technology marches on.
The world of tomorrow
Huang is heavily involved with Project Oxford, the company’s set of machine learning tools for image and speech recognition. If you used fun Microsoft sites like How-Old.net or MyMoustache, you’ve gotten some hands-on experience with what it can do.
Project Oxford is available to developers everywhere, meaning developers can put the technology in their own apps.
And just as Microsoft Cortana can listen to your spoken questions and give smart answers, Project Oxford lets developers of consumer apps, business software, and everything in between build technology with whom you can hold a decent conversation.
It means the rise of one interface — speech — that can control every kind of device, anywhere in a home. And with Microsoft Project Oxford, and the Microsoft Azure cloud that underpins it, Huang says Microsoft is in a great position to be at the center of the revolution.
“It took us 20 years to reach that goal,” Huang says.
Even as Microsoft works on AI, the company has already started to look ahead to what’s next, Huang says.
Indeed, he says that the Xbox Kinect sensor, which let people control Xbox video games with voice and motion, was actually born of Microsoft Research’s first cracks at building systems that could understand both speech and gestures to discern meaning.
Eventually, this will just be a new normal, Huang believes. Kids will grow up with these kinds of artificial intelligence systems, and they will be taken for granted as standard methods for interacting with technology.
“We are creating a new generation,” Huang says.
Business Insider Emails & Alerts
Site highlights each day to your inbox.