Monday, May 12, 2008

A billion monkeys

A trillion words, a billion five-word sequences, 13 million unique words appearing over 40 times! This is what has been accumulated from a Google research team in their project on n-grams.

Several aspects of this project have occurred to me before and I can see some more areas in which the research could be extended.

Google has been trawling the web for the fabric with which it is made - words - looking for the frequency with which words occur and, perhaps more interestingly, combinations of words.

This has many obvious uses - in particular in automatic error detection and in translation. It's a crude method, but if a machine can learn the structure of a language through a huge database of its current uses the chances of more natural translations increases greatly.

One idea I had while learning Chinese in Beijing came from the fact that while it's very simple to buy box sets with 'the thousand most common Chinese characters', actually, due to the structure of the language, what would be most useful would be lists of the most common character pairs, and combinations of four characters. The meaning usually comes from pairs of characters where each one individually can give only a vague idea of the meaning. This I would love to have and presumably could be quite easily be generated from an appropriate n-gram database. (If anyone already knows details about such a database, please tell me!)

And being Google, all of this data - around 24 Gigabytes, with the billions of combinations they've found is available for free which I can imagine will be of great use to linguistics departments around the world.

(*Note that the Google post shows that this news is not new, coming as it does from 2006*)

Thanks to the Google Operating system blog.

No comments: