Microsoft built technology that's better than a human at understanding a conversation

Microsoft research speech recognition teamDan DeLongThe Microsoft Research team responsible for setting the new milestone. Back row, left to right: Wayne Xiong, Geoffrey Zweig, Frank Seide. Front row, Xuedong Huang, Dong Yu, Mike Seltzer, Jasha Droppo and Andreas Stolcke.

In December 2015, Microsoft Chief Scientist of Speech Xuedong Huang told Business Insider
that “in the next four to five years, computers will be as good as humans” at understanding the words that come out of your mouth.

Less than a year later, and Microsoft just set a record with the announcement of a system that can transcribe the contents of a phone call with “the same or fewer errors” than real actual human professionals trained in transcription.

It’s a huge milestone for speech recognition, even as gadgets like Amazon Echo and Apple’s Airpods prove that voice is going to play a big role in the future of technology. And by Huang’s standard, that’s mission accomplished.

“We were able to move more quickly than we anticipated” thanks to advancements in artificial intelligence and acoustic technology, Microsoft Principal Researcher Geoffrey Zweig tells Business Insider, and “we were able to get here faster.”

Switchboard test

Back in the 1990’s, the National Institute of Standards and Technology (NIST) released a whole bunch of recorded phone conversations in English, Spanish, and Mandarin, called “Switchboard,” as a way to keep things fair for the field of speech recognition research. Everybody is working from the same data, so nobody can cheat.

Since then, lots of companies, including IBM, Google, and Microsoft itself, have used the Switchboard test as one of the main ways to check the accuracy of their speech recognition software.

A phone call is a great test because, as in real life, people mumble, mutter, cough, and otherwise stumble over their words, making automatic transcription a “much more difficult task” than it would be under laboratory conditions, Zweig says.

Xuedong HuangMicrosoftMicrosoft Distinguished Engineer Xuedong Huang

Back in September, Huang announced via blog entry that Microsoft Research had achieved an error rate on the Switchboard test of 6.3%. He said Microsoft’s error rate was believed to be the best in the whole industry, and only a hair above the 5.9% average error rate among professional transcribers.

So, Microsoft made some tweaks to the model, and did what Zweig says nobody had ever done before: Took the Switchboard test and gave it to those professionals to transcribe, to compare the results.

Why had nobody taken that step before? Maybe because it was “beyond the imagination” that even the best systems were even close to matching a human, Zweig speculates. Regardless, the results came back and NIST verified them.

Microsoft had officially built a speech recognition system that was better than a human.

What’s next?

In the shorter term, this technology is going to make Microsoft’s Cortana virtual assistant much better at understanding you. In the long term, Zweig says, Microsoft is working hard at using this successful model and then tweaking it for more situations.

Right now, it’s optmized for listening in on a conversation on a nice, stable landline telephone. With the core speech recognition algorithms all stable, now they can tweak it to better understand you when you’re on a noisy city street, or an echo-y conference room, or even using a McDonalds drive-thru.

And the more people use it in all these situations, the better it gets for everyone, Zweig says, as the algorithms learn and improve.

“This is a technology that’s constantly improving,” Zweig says.

Cortana on android phoneYouTube/MicrosoftMicrosoft Cortana on Android

And in general, this science is a huge and important step forward as speech recognition becomes ever more important to the future of technology. With the ability to understand the words coming out of your mouth, it’s a solid foundation on which to build better, smarter artificial intelligence that can find the context around the words.

“We’ve actually managed to advance the technology of speech recognition,” says Zweig.

NOW WATCH: This virtual character can translate speech into sign language

NOW WATCH: Tech Insider videos

Want to read a more in-depth view on the trends influencing Australian business and the global economy? BI / Research is designed to help executives and industry leaders understand the major challenges and opportunities for industry, technology, strategy and the economy in the future. Sign up for free at research.businessinsider.com.au.