Querying across languages: a dictionary-based approach to multilingual information retrieval

The multilingual information retrieval system of the future will need to be able to retrieve documents across language boundaries. This extension of the classical IR problem is particularly challenging, ax significant resources are required to perform query translation. At Xerox, we are working to build a multilingual IR system and conducting a series of experiments to understand what factors are most important in making the system work. Using translated queries and a bilingual transfer dictionary, we have learned that crosslartguage multilingual IR is feasible, although performance lags considerably behind the monolingual standard. The experiments suggest that correct identification and translation of multi-word terminology is the single most important source of error in the system, although amblguit y in translation also contributes to poor performance.

[1]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[2]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[3]  Douglas W. Oard,et al.  A survey of multilingual text retrieval , 1996 .

[4]  Hinrich Schütze,et al.  Xerox Site Report: Four TREC-4 Tracks , 1995, TREC.

[5]  Gerard Salton,et al.  Automatic Processing of Foreign Language Documents , 1969, COLING.

[6]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[7]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[8]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[9]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[10]  Jean-Pierre Chanod Finite-state Composition of French Verb Morphology Finite-state Composition of French Verb Morphology , 1994 .

[11]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[12]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[13]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[14]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[15]  Christos Faloutsos,et al.  On automatic filtering of multilingual texts , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[16]  Kenneth Ward Church,et al.  Termight: Identifying and Translating Technical Terminology , 1994, ANLP.

[17]  Robert L. Mercer,et al.  Word-Sense Disambiguation Using Statistical Methods , 1991, ACL.

[18]  William R. Hersh,et al.  Mapping Vocabularies Using Latent Semantics , 1998 .

[19]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[20]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[21]  Alon Itai,et al.  Two Languages Are More Informative Than One , 1991, ACL.

[22]  Penelope Sibun,et al.  Language Determination: Natural Language Processing from Scanned Document Images , 1994, ANLP.