Mining comparable bilingual text corpora for cross-language information integration

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.

[1]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[2]  Kumiko Tanaka-Ishii,et al.  Extraction of Lexical Translations from Non-Aligned Corpora , 1996, COLING.

[3]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[4]  Pascale Fung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL.

[5]  Stanley Peters,et al.  A Bootstrapping Method for Extracting Bilingual Text Pairs , 2000, COLING.

[6]  Jean V ronis Parallel Text Processing: Alignment and Use of Translation Corpora , 2002 .

[7]  Masatoshi Yoshikawa,et al.  Bilingual Terminology Acquisition from Comparable Corpora and Phrasal Translation to Cross-Language Information Retrieval , 2003, ACL.

[8]  Reinhard Rapp,et al.  Identifying Word Translations in Non-Parallel Texts , 1995, ACL.

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[11]  John D. Lafferty,et al.  Information Retrieval as Statistical Translation , 2017 .

[12]  John D. Lafferty,et al.  Two-stage language models for information retrieval , 2002, SIGIR '02.

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[15]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[16]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[17]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[18]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[19]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[20]  Pascale Pung,et al.  A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora , 1995, ACL 1995.

[21]  Jean Véronis,et al.  Parallel text processing :alignment and use of translationcorpora , 2000 .

[22]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[23]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.