Today, you can translate almost any word, phrase or document by simply plugging it into a search engine.
Microsoft, however, is envisioning a future where you’ll be able to hold a conversation with anyone around the world without the obstacle of language barriers.
At Re/code’s inaugural Code Conference, Microsoft unveiled it’s real-time speech translator for Skype — a technology that conjures up references to “Star Trek” and “A Hitchhiker’s Guide To The Galaxy” that’s been in the works for years.
The demo made the technology look natural and fluid — simply speak as you would in normal conversation, and the person on the other side would hear your words followed by a clean translation in their preferred language.
The “Star Trek”-like translator will become available before the end of 2014.
Vikram Dendi, a technical and strategy advisor for Microsoft Research, was brought on to the team five years ago specifically to work on translation technology.
We spoke with Dendi to learn a little more about how Microsoft created its real-time translator. Here’s the lightly edited Q&A.
Business Insider: When Microsoft’s CEO Satya Nadella talked about Skype’s real-time translator on stage, he emphasised how humanistic the technology is. What has Microsoft done to make the translations seem natural and conversational?
Vikram Dendi: So one of the early realisations for us as we were investing in translation was that it was really important that we don’t think like computers.
We don’t think like computers.
It was really important that we think of it as a human communication problem. While we were doing very cutting edge work on the computer science side, we were also looking very closely into how people communicated with each other.
I used to spend a lot of time going to a number of different countries where I’d interact with translators. And I came to the realisation that no translators agrees 100 per cent on how is best to translate something. If you take something and give it to two different translators, there will be a variation in how they translate.
Ultimately the end goal is to really create an understanding. So a lot of the work that we have done in our translation engine was really around creating a lot of flexibility and customizability.
For example we have something called a Translator Hub. The Translator Hub allows you to bring in the content that is representative of the type of content that you’re going to translate. And then we’d combine that knowledge with the knowledge of the web that the translator tool already has from Bing’s web index.
BI: What’s an example of how we’d see this Translator Hub work in real life?
VD: Let’s say you’re doing e-commerce and you have a fashion audience in shopping. And you want to have somebody in Russia buying from someone in England. And they’re selling some fashion accessory. Now, there is a particular type of lingo inside fashion. And if you train our language engine, they’re trying to say, no matter what kind of content you throw at me, I’m going to give you a reasonable answer.
We’ve also done some other fun stuff with it. We’ve had the same Translator Hub used by the Klingon Language Institute and Paramount Pictures to create the translation system for Klingon. It was more of a pop culture moment, but if you spend any time in the Klingon language community you realise how serious they are about preserving this language.
BI: Google showed off Conversation Mode for Google Translate a while back, which is similar to the real-time translator tool for Skype. How is Microsoft’s technology better than or different than Google’s?
VD: There have been a lot of attempts over the past few years in combining text translation technology with speech recognition. What’s really different about this is this idea of moving away from the focus on tech and putting the focus on really what the human problem is. That was very important. And Skype was very critical in the moment for that.
The way I think about this is, imagine that it’s like a wall. So imagine if there was a wall between these two communities of people, and you had to take it down. So in the beginning it could start with someone scribbling something and throwing it over the wall to the other side. And that’s a step in the right direction, people are able to communicate.
Then you chip away and chip away against that wall and eventually you make a little hole so that you could see that person on the other side. Eventually you remove this wall and people are able to speak freely. And the important thing is that when that wall is gone, no one even really remembers the axe that was used to bring it down.
And the important thing is that when that wall is gone, no one even really remembers the axe that was used to bring it down.
And you could just daisy chain speech recognition with machine translation, but one of the biggest challenges you’d have is the fluency. The “umms” and the “ahhs” that are critical to your conversation. Those things that when I say to you, you don’t even notice because your brain automatically eliminates them. Those are the types of things that are hard for a speech recognizer to create.
BI: On stage, Nadella also touched on the subject of transfer learning — the idea that Skype’s translator gets better at the languages it already knows once it learns a new language. Can you explain how this works?
VD: One of these key breakthroughs that happened in speech, and without that breakthrough we’d still be far away from getting this to the user, is this usage of deep neural networks.
It’s a type of machine learning that a lot of people were initially sceptical about. But a collaboration between Microsoft researchers and the University of Toronto proved that this really substantially improved speech recognition capabilities.
Studies have shown that someone who has known more than one language as they were growing up could pick up a third language quicker than someone who’s learning a language from scratch.
Deep neural networks are really modelled after neurons and the brain. In a way, they’re an attempt in communicating how our brains work. Not only does it know how to recognise French, somehow it also improves the English recognition. Now you add Chinese data, but both French and English improve. So it’s very fascinating.
BI: Creating an accurate real-time speech translator seems difficult. Even text translators today like Google Translate don’t always make sense. What are some of the biggest challenges Microsoft faced in creating this tool for Skype?
VD: In order for this thing to be really useful, you need to provide more flexibility. So that helps.
The other thing that’s really interesting is, the technology is also used in various social media content. So when we first started down that path when you saw the translation they didn’t make sense in a lot of cases. Social media is such a different language.
Even if you were speaking English, there’s all these abbreviations and the lingo is so different. To put it in context, the way we teach a language model, how to understand a language, is by feeding them with a lot of high quality translation. So when you try to translate something new, you’d expect well formed sentences because that’s how it’s learned.
When you try to feed it whatever the social media language is, it doesn’t understand what it is or how to interpret it. This scientist on our team came up with a really brilliant way of transforming the social media language into a more proper language and then fed it into the translation system and made it more capable and more intelligent.
And I’ve been asked many times about what’s the perfect translation.What I realised at that moment is that we could spend 10 years perfecting the universal translator. But we underestimate the ability of two people who want to communicate with each other.
Even if it doesn’t always get it right, people are willing to work with it.