Panlingual lexical translation via probabilistic inference

The bare minimum lexical resource required to translate between a pair of languages is a translation dictionary. Unfortunately, dictionaries exist only between a tiny fraction of the 49 million possible language-pairs making machine translation virtually impossible between most of the languages. This paper summarizes the last four years of our research motivated by the vision of panlingual communication. Our research comprises three key steps. First, we compile over 630 freely available dictionaries over the Web and convert this data into a single representation - the translation graph. Second, we build several inference algorithms that infer translations between word pairs even when no dictionary lists them as translations. Finally, we run our inference procedure offline to construct PANDICTIONARY- a sense-distinguished, massively multilingual dictionary that has translations in more than 1000 languages. Our experiments assess the quality of this dictionary and find that we have 4 times as many translations at a high precision of 0.9 compared to the English Wiktionary, which is the lexical resource closest to PANDICTIONARY.

[1]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[2]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[3]  David R. Karger A Randomized Fully Polynomial Time Approximation Scheme for the All-Terminal Network Reliability Problem , 1999, SIAM J. Comput..

[4]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[5]  Michael C. McCord,et al.  ACQUIRING LEXICAL DATA FROM MACHINE-READABLE DICTIONARY RESOURCES FOR MACHINE TRANSLATION , 1990 .

[6]  Oren Etzioni,et al.  Evaluating Lemmatic Communication , 2010 .

[7]  Nigel G. Ward Machine Translation: Past, Present, Future , 2001 .

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Helge Dyvik,et al.  Translations as semantic mirrors: from parallel corpus to wordnet , 2004 .

[10]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[11]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[12]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[13]  Mark Sanderson,et al.  Improving cross language retrieval with triangulated translation , 2001, SIGIR '01.

[14]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[15]  James Fogarty,et al.  Amplifying community content creation with mixed initiative information extraction , 2009, CHI.

[16]  Oren Etzioni,et al.  Lemmatic Machine Translation , 2009, MTSUMMIT.

[17]  Stephan Oepen,et al.  Open Source Machine Translation with DELPH-IN , 2005, MTSUMMIT.

[18]  Martin Franz,et al.  English-Chinese Information Retrieval at IBM , 2000, TREC.

[19]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[20]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[21]  David Yarowsky,et al.  Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages , 2002, CoNLL.

[22]  EstimationPeter,et al.  The Mathematics of Machine Translation : Parameter , 2004 .

[23]  Oren Etzioni,et al.  A Rose is a Roos is a Ruusu: Querying Translations for Web Image Search , 2009, ACL/IJCNLP.

[24]  Yorick Wilks,et al.  The Use of Machine Readable Dictionaries in the Pangloss Project , 1993 .

[25]  Oren Etzioni,et al.  Compiling a Massive, Multilingual Dictionary via Probabilistic Inference , 2009, ACL.

[26]  Robert C. Moore A Discriminative Framework for Bilingual Word Alignment , 2005, HLT.

[27]  Piek T. J. M. Vossen,et al.  Acquisition of lexical translation relations from MRDS , 2004, Machine Translation.

[28]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[29]  Andrew V. Goldberg,et al.  Beyond the flow decomposition barrier , 1998, JACM.

[30]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[31]  Evelyne Tzoukermann,et al.  The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries , 1990, COLING.

[32]  P. Lewis Ethnologue : languages of the world , 2009 .

[33]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[34]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[35]  Jaime G. Carbonell,et al.  Context-Based Machine Translation , 2006, AMTA.

[36]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[37]  Joel D. Martin,et al.  Word Alignment for Languages with Scarce Resources , 2005, ParallelText@ACL.

[38]  David Yarowsky,et al.  Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters , 2004, ACL.

[39]  Oren Etzioni,et al.  Lexical Translation with Application to Image Search on the Web , 2007 .

[40]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[41]  Helge Dyvik Semantic Mirrors : From Parallel Corpus to Wordnet , 2002 .