From Bilingual Dictionaries to Interlingual Document Representations

Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we develop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

[1]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[2]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[3]  David K. Smith Network Flows: Theory, Algorithms, and Applications , 1994 .

[4]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[5]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[7]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[8]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[9]  Wei Gao,et al.  Using English information in non-English web search , 2008, iNEWS '08.

[10]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[11]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[12]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[13]  Min Zhang,et al.  Feature-Based Method for Document Alignment in Comparable News Corpora , 2009, EACL.

[14]  K. Saravanan,et al.  MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora , 2009, EACL.

[15]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[16]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[17]  Susan T. Dumais,et al.  Automatic cross-linguistic information retrieval using latent semantic indexing , 2007 .

[18]  Manfred K. Warmuth,et al.  Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension , 2006, NIPS.

[19]  Hal Daumé,et al.  Kernelized Sorting for Natural Language Processing , 2010, AAAI.

[20]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[21]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[22]  Wei Gao,et al.  Exploiting Bilingual Information to Improve Web Search , 2009, ACL.

[23]  H. Hotelling Relations Between Two Sets of Variates , 1936 .