Applications of multilingual text retrieval

The recent enormous increase in the use of networked information access and on-line databases has led to more databases being available in languages other than English. The Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts is involved in a variety of industrial government, and digital library applications which have a need for multilingual text retrieval. Most information retrieval research, however has been evaluated using English databases and queries, and relatively little is and own about how well advanced statistical techniques that incorporate ranking and term weight perform in different languages. We describe our experience with a range of projects involving text retrieval in Spanish, Japanese and Chinese. The issues covered by these projects include document representation techniques such as morphology and segmentation, query formulation and expansion techniques, relevance feedback and comparisons of retrieval effectiveness with English databases. The results indicate that advanced statistical techniques are effective in a wide range of languages, and that new languages can be incorporated with only moderate effort.

[1]  WuZimin,et al.  Chinese text segmentation for text retrieval , 1993 .

[2]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[3]  Christian Fluhr,et al.  Multilingual access to textual databases , 1991, RIAO.

[4]  Hsinchun Chen,et al.  An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese-English) documents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[5]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[6]  A. S. Pollitt,et al.  An Evaluation of Concept Translation Through Menu Navigation in the MenUSE Intermediary System , 1993 .

[7]  Paul Blake,et al.  The MenUSE System for Multilingual Assisted Access to Online Databases, in the Context of Current EC-Funded Projects , 1992 .

[8]  W. B. Croft,et al.  Automatic Query Expansion for Japanese Text Retrieval , 1995 .

[9]  Pim van der Eijk Automating the Acquisition of Bilingual Terminology , 1993, EACL.

[10]  Carolyn J. Crouch,et al.  An approach to the automatic construction of global thesauri , 1990, Inf. Process. Manag..

[11]  M. P. Smith,et al.  Multilingual MenUSE - A Japanese front-end for searching English Language databases and vice versa , 1993 .

[12]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[13]  Michael W. Berry,et al.  Using latent semantic indexing for multilanguage information retrieval , 1995, Comput. Humanit..

[14]  W. Bruce Croft,et al.  An evaluation of query processing strategies using the TIPSTER collection , 1993, SIGIR.

[15]  Dagobert Soergel,et al.  Indexing and Retrieval Performance: The Logical Evidence , 1994, J. Am. Soc. Inf. Sci..

[16]  James Pustejovsky,et al.  The role of lexicons in natural language processing , 1996, CACM.

[17]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[18]  Corporate Unicode Staff,et al.  The Unicode Standard: Worldwide Character Encoding , 1991 .

[19]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[20]  W. Bruce Croft,et al.  Combining Automatic and Manual Index Representations in Probabilistic Retrieval , 1995, J. Am. Soc. Inf. Sci..

[21]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[22]  H. H. Neville,et al.  FEASIBILITY STUDY OF A SCHEME FOR RECONCILING THESAURI COVERING A COMMON SUBJECT , 1970 .

[23]  William R. Hersh,et al.  Mapping Vocabularies Using Latent Semantics , 1998 .

[24]  Joan M. Aliprand The Unicode Standard , 1996 .

[25]  Gerard Salton,et al.  Automatic Processing of Foreign Language Documents , 1969, COLING.

[26]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[27]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[28]  Robert G. Reynolds,et al.  Query Translation Using Evolutionary Programming for Multi-Lingual Information Retrieval , 1995 .

[29]  Gerard Salton,et al.  Experiments in Multi-Lingual Information Retrieval , 1972, Inf. Process. Lett..

[30]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[31]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[32]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[33]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[34]  W. Bruce Croft NSF Center for Intelligent Information Retrieval , 1995, CACM.

[35]  Paule Rolland-Thomas,et al.  Subject Access in a Bilingual Online Catalogue , 1989 .

[36]  R. Mark,et al.  Note on references , 1973 .

[37]  Geoffrey P. Ellis,et al.  A common query interface for multilingual document retrieval from databases of the European Community Institutions (abstract) , 1993, SIGIR.

[38]  Ariane Iljon,et al.  Scientific and technical data bases in a multilingual society , 1977 .

[39]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[40]  A. I. Lebowitz,et al.  Retrieval in bibliographic systems: the AGRIS experience , 1991 .

[41]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[42]  Hiroaki Kitano,et al.  Multilingual Information Retrieval Mechanism Using VLSI. Requirements and Approaches for Information Retrieval Systems in the Computer-Aided Software Engineering and Document Processing Environment , 1988, RIAO.

[43]  Paul G. Young Cross-Language Information Retrieval Using Latent Semantic Indexing , 1994 .

[44]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[45]  Carmen López de Sosoaga Multilingual access to documentary database , 1991 .

[46]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[47]  Guidelines for Establishment and Development of Multilingual Scientific and Technical Thesauri for Ini rmation Retrieval , 2022 .

[48]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[49]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[50]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[51]  C. P. R. Dubois,et al.  Free text vs. controlled vocabulary; a reassessment , 1987 .

[52]  W. Bruce Croft,et al.  Corpus-Specific Stemming using Work Form Co-occurrence , 1994 .

[53]  Mark W. Davis,et al.  Query Translation Using Evolutionary Programming for Multilingual Information Retrieval II , 1995, Evolutionary Programming.