Learning Bilingual Lexicons from Monolingual Corpora

We present a method for learning bilingual translation lexicons from monolingual corpora. Word types in each language are characterized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a variety of language pairs and from a range of corpus types.

[1]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[2]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3]  Pascale Fung,et al.  Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus , 1995, VLC@ACL.

[4]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[5]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[7]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[8]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[9]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[10]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[11]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[12]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[13]  Noah A. Smith,et al.  Annealing Structural Bias in Multilingual Weighted Grammar Induction , 2006, ACL.

[14]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[15]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[16]  Dan Klein,et al.  Agreement-Based Learning , 2007, NIPS.