A New Player in the Machine Translation Space

Coming in on the GO Train the other morning I was flipping through the latest issue of Wired page by page, trying to find anything I missed – Well, it turns out that all this month I’ve been missing an article about “Meaningful Machines” a new-ish startup that just emerged from stealth mode.

Meaningful Machines has been building a statistical-based machine translation system and are essentially a competitor to our technology partner, Language Weaver (who I’m glad to see got at least some mention in the article).

The article is worth a read if you’re not familiar with how Statistical Based Machine Translation (SMTS) works as the author does a pretty good job of explaining it.

Their software certainly sounds interesting but they’re pretty candid that they still have the same challenge as Google in that it takes an enormous amount of processing power to translate anything. From what I’ve heard Google’s top scoring BLEU scores are the result of thousands of servers grinding away – In this realm more servers/processors = better results. Even with many servers Meaningful’s system averages about 10 seconds per word!

The most interesting aspect of this software is the fact it doesn’t require parallel corpora (aligned bodies of content in two languages) – they use a massive bilingual dictionary and then compare the translation in 5-8 word chunks to a massive database of content in the target language:

Given a passage to translate from Spanish, the system looks at each sentence in consecutive five- to eight-word chunks. Using the dictionary, the software employs a process called flooding to generate and store all possible English translations for the words in that chunk.

Making this work effectively requires a dictionary that includes all of the possible conjugations and variations for every word.

The options spit out by the dictionary for each chunk of text can number in the thousands, many of which are gibberish. To determine the most coherent candidates, the system scans the 150 Gbytes of English text, ranking candidates by how many times they appear. The more often they’ve actually been used by an English speaker, the more likely they are to be a correct translation.

Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk. … Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk’s translation options and the ones before and after it.

One of the more interesting claims they make in the article is that their system’s output takes half as long for a translator to cleanup/polish then if they started from scratch. The author mentions though that he didn’t actually see the results from their test in conjunction with a translation agency. I can certainly believe in certain circumstances they can achieve this but it likely would be out of reach of the average organization in the short term.

It doesn’t appear they’ve actually released the software yet but simply come out of stealth mode. They’ll certainly be an organization to watch for down the road but, as the article mentions, they’re going to be playing catchup once they launch. Language Weaver has a huge headstart and they’ve also managed to get their system to the point where it can run effectively on a single server (but they’ll be the first to admit more is always better).

It’s certainly going to be a fun space to watch over the next few years.