A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

A main challenge in Cross-Language information retrieval is to estimate a translation language model, as its quality directly affects the retrieval performance. The translation language model is built using translation resources such as bilingual dictionaries, parallel corpora, or comparable corpora. In general, high quality resources may not be available for scarce-resource languages. For these languages, efficient exploitation of commonly available resources such as comparable corpora is considered more crucial. In this paper, we focus on using only comparable corpora to extract translation information more efficiently. We propose a language modeling approach for estimating the translation language model. The proposed method is based on probability distribution estimation, and can be tuned easier in comparison with heuristically adjusted previous work. Experiment results show a significant improvement in the translation quality and CLIR performance compared to the previous approaches.

[1]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[2]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[3]  Pierre Zweigenbaum,et al.  Looking for Candidate Translational Equivalents in Specialized, Comparable Corpora , 2002, COLING.

[4]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[5]  Emmanuel Morin,et al.  Adaptive Dictionary for Bilingual Lexicon Extraction from Comparable Corpora , 2012, LREC.

[6]  Masatoshi Yoshikawa,et al.  Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora , 2003, SIGIR '03.

[7]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[8]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[9]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[10]  Xavier Carreras,et al.  Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009 , 2009, CoNLL.

[11]  Marie-Francine Moens,et al.  Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge , 2012, EACL.

[12]  Emmanuel Morin,et al.  QAlign: A New Method for Bilingual Lexicon Extraction from Comparable Corpora , 2012, CICLing.

[13]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[14]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[15]  Éric Gaussier,et al.  Clustering Comparable Corpora For Bilingual Lexicon Extraction , 2011, ACL.

[16]  Azadeh Shakery,et al.  Topic Based Creation of a Persian-English Comparable Corpus , 2011, AIRS.

[17]  Azadeh Shakery,et al.  Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs , 2012, Information Retrieval.

[18]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[19]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.

[20]  Tie-Yan Liu,et al.  Information Retrieval Technology , 2014, Lecture Notes in Computer Science.

[21]  Tao Tao,et al.  Mining comparable bilingual text corpora for cross-language information integration , 2005, KDD '05.

[22]  van Gerardus Noord,et al.  Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010) , 2010 .

[23]  David Yarowsky,et al.  Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences , 2009, CoNLL.