Multilingual Information Retrieval with Asian Languages

There has been increasing interest in the Chinese, Japanese and Korean languages on the Web and the first objective of this paper is to compare the retrieval performances of nine vector-space and two probabilistic models when carrying out a monolingual search using these three Asian languages. Based on the latest NTCIR-3 test collection, our second goal is to analyze the relative merit of using various automated tools to translate English-language topics into Chinese, Japanese or Korean, and then submitting a search based on texts written in these languages. Moreover, we will show how to improve bilingual searches by using both a combined translation strategy and a data fusion approach. Finally, we will address the underling problems of multilingual searches when an English topic is used to search documents written in the English, Chinese and Japanese languages.

[1]  WalkerS.,et al.  Experimentation as a way of life , 2000 .

[2]  TREC-9 Cross-Language Information Retrieval (English-Chinese) Overview , 2000, TREC.

[3]  Julio Gonzalo,et al.  Advances in Cross-Language Information Retrieval , 2002, Lecture Notes in Computer Science.

[4]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[5]  Kui-Lam Kwok,et al.  TREC-3 Ad-Hoc, Routing Retrieval and Thresholding Experiments using PIRCS , 1994, TREC.

[6]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[7]  Jeong Soo Ahn,et al.  Using n-grams for Korean text retrieval , 1996, SIGIR '96.

[8]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[9]  W. Bruce Croft,et al.  A comparison of indexing techniques for Japanese text retrieval , 1993, SIGIR.

[10]  Ken Lunde,et al.  CJKV Information Processing , 1999 .

[11]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[12]  Fredric C. Gey,et al.  Experiments on Cross-language and Patent Retrieval at NTCIR-3 Workshop , 2002, NTCIR.

[13]  J. H. Lee,et al.  n-Gram-based indexing for Korean text retrieval , 1999, Inf. Process. Manag..

[14]  Kui-Lam Kwok NTCIR-2 Chinese, Cross Language Retrieval Experiments Using PIRCS , 2001, NTCIR.

[15]  Jacques Savoy,et al.  Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence , 2002, CLEF.

[16]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[17]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[18]  Tim Leek,et al.  Probabilistic approaches to topic detection and tracking , 2002 .

[19]  Lin Du,et al.  ISCAS at NTCIR-3: Monolingual, Bilingual and MultiLingual IR Tasks , 2002, NTCIR.

[20]  Richard Sproat,et al.  Morphology and computation , 1992 .

[21]  K. L. Kwok Employing multiple representations for Chinese information retrieval , 1999 .

[22]  Noriko Kando CLIR at NTCIR Workshop 3: Cross-Language and Cross-Genre Retrieval , 2002, CLEF.

[23]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[24]  Hui Li,et al.  Chinese word segmentation and its effect on information retrieval , 2004, Inf. Process. Manag..

[25]  Kui-Lam Kwok,et al.  A comparison of Chinese document indexing strategies and retrieval models , 2002, TALIP.

[26]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[27]  Masaki Murata,et al.  Applying Multiple Characteristics and Techniques to Obtain High Levels of Performance in Information Retrieval at NTCIR-4 , 2002, NTCIR.

[28]  Ken Lunde CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing , 1999 .

[29]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[30]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[31]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.