Combining Lexical Context with Pseudo-alignment for Bilingual Lexicon Extraction from Comparable Corpora

Only a few studies have made use of alignment information in bilingual lexicon extraction from comparable corpora, in which comparable corpora are necessarily divided into 1-1 aligned document pairs. They have not been able to show extracted lexicons benefit from the embedding of alignment information. Moreover, strict 1-1 alignments do not exist broadly in comparable corpora. We develop in this paper a language-independent approach to lexicon extraction by combining the classic lexical context with pseudo-alignment information. Experiments on the English-French comparable corpus demonstrate that pseudo-alignment in comparable corpora is an essential feature leading to a significant improvement of standard method of lexicon extraction, a perspective that have never been investigated in a similar way by previous studies.

[1]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[2]  Pascale Fung,et al.  Rare Word Translation Extraction from Aligned Comparable Documents , 2011, ACL.

[3]  Kyo Kageura,et al.  Bilingual Terminology Mining - Using Brain, not brawn comparable corpora , 2007, ACL.

[4]  Jean-Michel Renders,et al.  A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora , 2004, ACL.

[5]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[6]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[7]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[8]  Philippe Langlais,et al.  Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora , 2010, COLING.

[9]  Kun Yu,et al.  Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity , 2009, HLT-NAACL.

[10]  David Yarowsky,et al.  Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences , 2009, CoNLL.

[11]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[12]  Satoshi Sato,et al.  Compiling French-Japanese Terminologies from the Web , 2006, EACL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[14]  I. Dan Melamed A Word-to-Word Model of Translational Equivalence , 1997, ACL.

[15]  Hwee Tou Ng,et al.  Mining New Word Translations from Comparable Corpora , 2004, COLING.

[16]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[17]  I. Dan Melamed A portable algorithm for mapping bitext correspondence , 1997 .

[18]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[19]  Éric Gaussier,et al.  Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora , 2010, COLING.