Translingual Information Retrieval: Learning from Bilingual Corpora

Abstract Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms dictionarybased methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) perform well, as does translingual pseudo-relevance feedback (PRF) and Example-Based Term-in-context translation (EBT). All showed relatively small performance loss between monolingual and translingual versions, ranging between 87–101% of monolingual IR performance. Query translation based on a general machine-readable bilingual dictionary—heretofore the most popular method—did not match the performance of other, more sophisticated methods. Also, the previous very high LSI results in the literature based on “mate-finding” were superseded by more realistic relevance-based evaluations; LSI performance proved comparable to that of other statistical corpus-based methods.

[1]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[2]  Ralf D. Brown,et al.  Example-Based Machine Translation in the Pangloss System , 1996, COLING.

[3]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[4]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[5]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[6]  David Graff,et al.  Multilingual Text Resources at the Linguistic Data Consortium , 1994, HLT.

[7]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[8]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[9]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[10]  Peter Schäuble,et al.  Cross-language speech retrieval: establishing a baseline performance , 1997, SIGIR '97.

[11]  Gerard Salton,et al.  Automatic Processing of Foreign Language Documents , 1969, COLING.

[12]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[13]  A Elithorn,et al.  ARTIFICIAL AND HUMAN INTELLIGENCE , 1984 .

[14]  Sergei Nirenburg,et al.  Integrating Translations from Multiple Sources within the PANGLOSS Mark III Machine Translation System , 1994, AMTA.

[15]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[16]  Jaime G. Carbonell,et al.  New Approaches to Machine Translation , 1985, TMI.

[17]  Makoto Nagao,et al.  A framework of a mechanical translation between Japanese and English by analogy principle , 1984 .

[18]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[19]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[20]  Mark W. Davis,et al.  QUILT: implementing a large-scale cross-language text retrieval system , 1997, SIGIR '97.

[21]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[22]  Padmini Srinivasan,et al.  Optimal Document-Indexing Vocabulary for MEDLINE , 1996, Inf. Process. Manag..

[23]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[24]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[25]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[26]  Ralf D. Brown Automatically-Extracted Thesauri for Cross-Language IR: When Better is Worse , 1998 .

[27]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.