Move over America…

… China is about to become the largest online population. Between the end of 2006 and the end of 2007 China added roughly 73 Million users to the Internet.


To put it in perspective – even if Canada doubled it’s population and put an internet connection in the house of every man, woman and child in the country. We’d still come up about 7 Million people short.

On the flip side though there’s two factors at work here. China’s population is roughly 1.3 Billion right now which means a total user base of 210 Million is only a 16% penetration rate. In Canada we have a ~65% penetration and the US has ~71%.

India will no doubt pick up steam in the coming and will definitely rank in the number 2, if not number 1 spot.

So what does this mean for the Internet in general?

The connected world’s borders are no longer geographical – they’re lingual.

The world may be flattening, but there’s still a a few big walls running across the landscape. The reality is the “hidden web” is going to keep growing. As I’ve posted about before, your ability to access information online revolves almost exclusively around the languages you can read/write.

As countries like China & India continue to pump new users online more and more content will be generated in their native languages, likely invisible to you unless you speak (and search in) that language.

Google’s getting better and better with opening access to these sites through their machine translation tools but the reality is there just isn’t enough CPU horsepower to run every Google search through machine translation for all the different language variations.

Language Weaver, through Kontrib, is also making an interesting attempt at opening up more content to a broader audience through a Digg like portal. It’s a great idea although I think they’re going to have a hard time getting the traction it needs. I’d personally love to see them work with Digg directly instead and create a licensing deal similar to what my friends at Idee have done with their image duplication detection technology.

It’s going to be interesting to watch this story play out. Who ever busts the language barrier the mos effectively first will dramatically change the search game. Google is clearly out in front, and the most likely victor, but you never know who’s running in stealth right now and could surprise us all.

A New Player in the Machine Translation Space

Coming in on the GO Train the other morning I was flipping through the latest issue of Wired page by page, trying to find anything I missed – Well, it turns out that all this month I’ve been missing an article about “Meaningful Machines” a new-ish startup that just emerged from stealth mode.

Meaningful Machines has been building a statistical-based machine translation system and are essentially a competitor to our technology partner, Language Weaver (who I’m glad to see got at least some mention in the article).

The article is worth a read if you’re not familiar with how Statistical Based Machine Translation (SMTS) works as the author does a pretty good job of explaining it.

Their software certainly sounds interesting but they’re pretty candid that they still have the same challenge as Google in that it takes an enormous amount of processing power to translate anything. From what I’ve heard Google’s top scoring BLEU scores are the result of thousands of servers grinding away – In this realm more servers/processors = better results. Even with many servers Meaningful’s system averages about 10 seconds per word!

The most interesting aspect of this software is the fact it doesn’t require parallel corpora (aligned bodies of content in two languages) – they use a massive bilingual dictionary and then compare the translation in 5-8 word chunks to a massive database of content in the target language:

Given a passage to translate from Spanish, the system looks at each sentence in consecutive five- to eight-word chunks. Using the dictionary, the software employs a process called flooding to generate and store all possible English translations for the words in that chunk.

Making this work effectively requires a dictionary that includes all of the possible conjugations and variations for every word.

The options spit out by the dictionary for each chunk of text can number in the thousands, many of which are gibberish. To determine the most coherent candidates, the system scans the 150 Gbytes of English text, ranking candidates by how many times they appear. The more often they’ve actually been used by an English speaker, the more likely they are to be a correct translation.

Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk. … Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk’s translation options and the ones before and after it.

One of the more interesting claims they make in the article is that their system’s output takes half as long for a translator to cleanup/polish then if they started from scratch. The author mentions though that he didn’t actually see the results from their test in conjunction with a translation agency. I can certainly believe in certain circumstances they can achieve this but it likely would be out of reach of the average organization in the short term.

It doesn’t appear they’ve actually released the software yet but simply come out of stealth mode. They’ll certainly be an organization to watch for down the road but, as the article mentions, they’re going to be playing catchup once they launch. Language Weaver has a huge headstart and they’ve also managed to get their system to the point where it can run effectively on a single server (but they’ll be the first to admit more is always better).

It’s certainly going to be a fun space to watch over the next few years.