A Framework for Cross-Language Information Access: Application to English and Japanese

Internet search engines allow access to online information from all over the world. However, there is currently a general assumption that users are fluent in the languages of all documentsthat they might search for. This has for historical reasons usually been a choice between English and the locally supported language. Given the rapidly growing size of the Internet, it is likely that future users will need to access information in languages in which they are not fluent or have no knowledge of at all. This papershows how information retrieval and machine translation can becombined in a cross-language information access frameworkto help overcome the language barrier. We presentencouraging preliminary experimental results using English queries toretrieve documents from the standard Japanese language BMIR-J2retrieval test collection. We outline the scope and purpose ofcross-language information access and provide an example applicationto suggest that technology already exists to provide effective andpotentially useful applications.

[1]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[2]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[3]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[4]  Toru Matsuda,et al.  Overlapping statistical word indexing: a new indexing method for Japanese text , 1997, SIGIR '97.

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[7]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[8]  Hiroyasu Nogami,et al.  EJ/JE Machine Translation System ASTRANSAC — Extensions toward Personalization , 1991, MTSUMMIT.

[9]  Gareth J. F. Jones,et al.  Experiments in Japanese text retrieval and routing using the NEAT system , 1998, SIGIR '98.

[10]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[11]  Noriko Kando,et al.  NTCIR workshop : proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition , 1999 .

[12]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[13]  S. Robertson The probability ranking principle in IR , 1997 .

[14]  Nigel Collier,et al.  A comparison of query translation methods for English-Japanese cross-language information retrieval (poster abstract) , 1999, SIGIR '99.

[15]  Douglas W. Oard,et al.  Support for Interactive Document Selection in Cross-Language Information Retrieval , 1999, Inf. Process. Manag..

[16]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[17]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[18]  Sakai Tetsuya,et al.  First Experiments on the BMIR - J2 Collection using the NEAT System , 1998 .

[19]  Lee-Feng Chien Fast and quasi-natural language search for gigabytes of Chinese texts , 1995, SIGIR '95.

[20]  Karen Spärck Jones,et al.  Automatic content-based retrieval of broadcast news , 1995, MULTIMEDIA '95.

[21]  Dagobert Soergel,et al.  Multilingual Thesauri in Cross-Language Text and Speech Retrieval , 1997 .

[22]  Fredric C. Gey,et al.  Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire II at TREC-7 , 1998, TREC.

[23]  Douglas W. Oard,et al.  A survey of multilingual text retrieval , 1996 .

[24]  Julia Galliers,et al.  Evaluating natural language processing systems , 1995 .

[25]  Karen Spärck Jones,et al.  Retrieving spoken documents by combining multiple index sources , 1996, SIGIR '96.

[26]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[27]  Yasushi Ogawa,et al.  A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR '95.

[28]  Marti A. Hearst TileBars: visualization of term distribution information in full text information access , 1995, CHI '95.

[29]  Stephen E. Robertson,et al.  Okapi at TREC-6 Automatic ad hoc, VLC, routing, filtering and QSDR , 1997, TREC.

[30]  Nigel Collier,et al.  Machine Translation versus Dictionary Term Translation - A Comparison for English-Japanese News Article Alignment , 1998, COLING-ACL.

[31]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[32]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[33]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[34]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[35]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[36]  Tetsuya Sakai,et al.  Lessons from BMIR-J2: a test collection for Japanese IR systems , 1998, SIGIR '98.

[37]  Ogawa Yasushi,et al.  A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR 1995.

[38]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[39]  Jian-Yun Nie,et al.  On Chinese text retrieval , 1996, SIGIR '96.

[40]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..