Using Corpus-Based Approaches in a System for Multilingual Information Retrieval

We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison.

[1]  Martin Franz,et al.  Ad hoc, Cross-language and Spoken Document Information Retrieval at IBM , 1999, Text Retrieval Conference.

[2]  Pascale Fung,et al.  Finding Terminology Translations from Non-parallel Corpora , 1997, VLC.

[3]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[4]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[5]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[6]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[7]  Carol Peters,et al.  The Evaluation of Systems for Cross-language Information Retrieval , 2000, LREC.

[8]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[9]  Martin Braschler,et al.  SPIDER Retrieval System at TREC7 , 1998, TREC.

[10]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[11]  Martin Braschler,et al.  Multilingual Information Retrieval Based on Document Alignment Techniques , 1998, ECDL.

[12]  Peter Schäuble,et al.  The Various Roles of Information Structures , 1993 .

[13]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[14]  Donna K. Harman,et al.  Relevance Feedback and Other Query Modification Techniques , 1992, Information retrieval (Boston).

[15]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[16]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[17]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[18]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[19]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[20]  Yonggang Qiu Automatic query expansion based on a similarity thesaurus , 1995 .