Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge

In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precision-oriented algorithm that relies on per-topic word distributions obtained by the bilingual LDA (BiLDA) latent topic model. The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashion, without any prior knowledge about the language pair, relying on a symmetrization process and the one-to-one constraint. We report our results for Italian-English and Dutch-English language pairs that outperform the current state-of-the-art results by a significant margin. In addition, we show how to use the algorithm for the construction of high-quality initial seed lexicons of translations.

[1]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[2]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[3]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[4]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[5]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[7]  Mona T. Diab,et al.  A statistical translation model using comparable corpora , 2000, RIAO.

[8]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[9]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[10]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[11]  Éric Gaussier,et al.  Clustering Comparable Corpora For Bilingual Lexicon Extraction , 2011, ACL.

[12]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[15]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[16]  Min Zhang,et al.  Feature-Based Method for Document Alignment in Comparable News Corpora , 2009, EACL.

[17]  Marie-Francine Moens,et al.  Cross-language linking of news stories on the web using interlingual topic modelling , 2009, CIKM-SWSM.

[18]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[19]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[20]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[21]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[22]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[23]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[24]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[25]  Steffen Staab,et al.  Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[26]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[27]  Ari Rappoport,et al.  Bilingual Lexicon Generation Using Non-Aligned Signatures , 2010, ACL.

[28]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[29]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.