Yesterday's New York Times article by Miguel Helft, Google’s Computing Power Refines Translation Tool, presented Google's strategy for automated, online language translation. Google wins because they have both the data and the horsepower to make statistical approaches to automated translation effective. Right now, Google Translate accommodates 52 languages. And that means any combination of language pairs from that 52, so for example you can translate Estonian into Vietnamese. Not perfect, but it's much better than where we were even a few years ago. Here's some text from kaunis.ee translated from Estonian to Vietnamese to English using Google Translate:
- Estonian: Kõik me soovime olla kaunid, olla ilusad nii sisemiselt kui väliselt. Nii sisemine kui väline ilu on mõlemad ühtmoodi olulised ning mõlema eest tuleb hoolitseda. Lõhn, jumestus ja riietus moodustavad isikupärase stiili, sinu välise ilu. Sinu ilusse soovime anda omapoolse panuse ka meie, olles abiks lemmiklõhnade ja kosmeetikatoodete soetamisel.
- Vietnamese: Tất cả chúng tôi muốn được đẹp, phải đẹp cả trong nội bộ và bên ngoài. Cả hai vẻ đẹp bên trong và bên ngoài là cả hai đều quan trọng và cần cả hai để chăm sóc. Mùi, trang điểm và ăn mặc đến phong cách cá nhân, vẻ đẹp bên ngoài của bạn. Vẻ đẹp của bạn muốn đóng góp cho chúng tôi cũng đang được và giúp lemmiklõhnade mua mỹ phẩm.
- English: We all want to be beautiful, be beautiful both internally and externally. Both beautiful inside and outside is both important and both need to care. Smell, makeup and dress to personal style, the beauty outside of you. The beauty of you want to contribute to our well being and help lemmiklohnade buy cosmetics.
To put that 52 number in perspective, the top 20 languages in the world account for 5.1 billion speakers, or roughly 75% of the world population (calculation from WolframAlpha). On the other hand, 94% of the world's languages are only spoken by 6% of the world's population. Communication and information access are important parts of translation, but so are cultural understanding and language preservation.
My interest in smaller languages is primarily because of my experience with Mongolian. There are only about six million speakers of the Mongolian macrolanguage, primarily found in Mongolia and Inner Mongolia. So it's not a tiny language and it's not in any imminent danger of disappearing. Still, automated translation tools would help communication and information access in both directions. For 99.91% of the world, sites like sites like Olloo.mn are incomprehensible. And for most Mongolians, sites like English Wikipedia are incomprehensible. (There is a Mongolian version of Wikipedia, with about 2% of the article count of English Wikipedia, but the quality generally isn't as good because of a lower number of active users.)
What was most interesting to me from the Times article was Google's plan for small languages covered in more detail on the Helft's blog this morning. They've created something called the Google Translator Toolkit. They are encouraging individuals to upload translations from other languages to their system, for example:
Mr. [Te Taka] Keegan uses a tool called the Google Translator Toolkit to upload Maori translations of English texts to Google. Others can then use those translations in their work, increasing the quantity and quality of Maori translations that are available, and creating incentives for children of Maori descent to learn the language.
What Google has lacked to date for smaller languages is data. Given enough participation, this may solve that problem for languages like Maori and Mongolian.
Mongolian of course suffers from another problem in practice, which is that it uses multiple character sets: Cyrillic, Mongolian script, and Mongolized Latin. This Mongolized Latin is especially prominent in chat and text messaging, and is not at all standardized. As I wrote in the language note for my dissertation "The word бөх (wrestler) can be romanized at least nine different ways (b{u,ø,o}{h,kh,x}), some resulting in ambiguity of meaning (e.g., “bull”, “all”, and if you’re not careful, “gum”)."
Yamli is apparently solving this problem for Arabic, by incorporating Arabizi-Arabic translation in its search. For now, one of the best Mongolian translation tools out there remains an "old-fashioned" one, Bolor Toli, an online Mongolian-English-German dictionary.
(The reference in the title is to this Babel fish.)
Recent Comments