Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach

Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper will seek to present an approach to bilingual lexicon extraction from non-aligned comparable corpora, combination to linguisticsbased pruning and evaluations on CrossLanguage Information Retrieval. We propose and explore a two-stages translation model for the acquisition of bilingual terminology from comparable corpora, disambiguation and selection of best translation alternatives on the basis of their morphological knowledge. Evaluations using a large-scale test collection on JapaneseEnglish and different weighting schemes of SMART retrieval system confirmed the effectiveness of the proposed combination of two-stages comparable corpora and linguistics-based pruning on CrossLanguage Information Retrieval.

[1]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[2]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[3]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[4]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[5]  Masatoshi Yoshikawa,et al.  Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora , 2003, SIGIR '03.

[6]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[7]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[8]  Evelyne Tzoukermann,et al.  Combining corpus and machine-readable dictionary data for building bilingual lexicons , 1995, Machine Translation.

[9]  Peter Schäuble,et al.  Effective and Efficient Retrieval from Large and Dynamic Document Collections , 1993, TREC.

[10]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[11]  Pascale Fung,et al.  A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora , 1998, AMTA.

[12]  Fatiha Sadat,et al.  An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction , 2002, COLING.

[13]  Jacques Savoy,et al.  Cross-language information retrieval: experiments based on CLEF 2000 corpora , 2003, Inf. Process. Manag..

[14]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[15]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[16]  Norbert Fuhr,et al.  Probalistic Learning Approaches for Indexing and Retrieval with the TREC-2 Collection , 1993, TREC.

[17]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[18]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[19]  Hiroshi Nakagawa Disambiguation of Lexical Trans-lations based on Bilingual Comparable Corpora , 2000 .

[20]  James Allan,et al.  Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 , 1993, TREC.

[21]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[22]  Noriko Kando Overview of the Second NTCIR Workshop , 2001, NTCIR.