Compiling a Massive, Multilingual Dictionary via Probabilistic Inference

Can we automatically compose a large set of Wiktionaries and translation dictionaries to yield a massive, multilingual dictionary whose coverage is substantially greater than that of any of its constituent dictionaries? The composition of multiple translation dictionaries leads to a transitive inference problem: if word A translates to word B which in turn translates to word C, what is the probability that C is a translation of A? The paper introduces a novel algorithm that solves this problem for 10,000,000 words in more than 1,000 languages. The algorithm yields PanDictionary, a novel multilingual dictionary. PanDictionary contains more than four times as many translations than in the largest Wiktionary at precision 0.90 and over 200,000,000 pairwise translations in over 200,000 language pairs at precision 0.8.

[1]  Michael C. McCord,et al.  ACQUIRING LEXICAL DATA FROM MACHINE-READABLE DICTIONARY RESOURCES FOR MACHINE TRANSLATION , 1990 .

[2]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[3]  Yorick Wilks,et al.  The Use of Machine Readable Dictionaries in the Pangloss Project , 1993 .

[4]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[5]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[6]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[7]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[8]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[9]  Marc Najork,et al.  Measuring Index Quality Using Random Walks on the Web , 1999, Comput. Networks.

[10]  David R. Karger A Randomized Fully Polynomial Time Approximation Scheme for the All-Terminal Network Reliability Problem , 1999, SIAM J. Comput..

[11]  Martin Franz,et al.  English-Chinese Information Retrieval at IBM , 2000, TREC.

[12]  Mark Sanderson,et al.  Improving cross language retrieval with triangulated translation , 2001, SIGIR '01.

[13]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[14]  Piek T. J. M. Vossen,et al.  Acquisition of lexical translation relations from MRDS , 2004, Machine Translation.

[15]  Helge Dyvik,et al.  Translations as semantic mirrors: from parallel corpus to wordnet , 2004 .

[16]  Open Source Machine Translation with DELPH-IN , 2005, MTSUMMIT.

[17]  Jaime G. Carbonell,et al.  Context-Based Machine Translation , 2006, AMTA.

[18]  Oren Etzioni,et al.  Lexical Translation with Application to Image Search on the Web , 2007 .

[19]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[20]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[21]  James Fogarty,et al.  Amplifying community content creation with mixed initiative information extraction , 2009, CHI.