Phrasal Translation for English-Chinese Cross Language Information Retrieval

This paper introduces a simple and effective nonoverlapping unigram and bigram segmentation method for both monolingual Chinese and English-Chinese cross language retrieval. It also describes English-Chinese cross language retrieval experiments involving 54 topics and some 164,000 documents. The translation of English queries to Chinese is done using a Chinese-English dictionary of about 120,000 entries. A technique for extracting noun phrases is presented and applied prior to query translation. The phrasal translation outperformanced word translation by 23.6% even though most of the extracted noun phrases from the queries were not translated as phrase because of the limited coverage of the bilingual dictionary. The cross language retrieval achieved about 53% of the effectiveness of the monolingual retrieval, which suggests that there is lot of room for improvement. The two main limiting factors in English-Chinese retrieval performance are the limited coverage of the bilingual dictionary and the existence of multiple Chinese translation equivalents for many

[1]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[2]  Fredric C. Gey,et al.  Full Text Retrieval based on Probalistic Equations with Coefficients fitted by Logistic Regression , 1993, TREC.

[3]  Christopher S. G. Khoo,et al.  A new statistical formula for Chinese text segmentation incorporating contextual information , 1999, SIGIR '99.

[4]  Douglas W. Oard,et al.  Cross-language Information Retrieval , 2021, ArXiv.

[5]  Dania Egedi,et al.  A freely available wide coverage morphological analyzer for English , 1992, COLING 1992.

[6]  Padhraic Smyth,et al.  Discovering Chinese Words from Unsegmented Text , 1999, SIGIR 1999.

[7]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[8]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[9]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[10]  Dania Egedi,et al.  A Freely Available Wide Coverage Morphological Analyzer for English , 1992, COLING.

[11]  Sun Maosong,et al.  CSeg&Tagl.0: A Practical Word Segmenter and POS Tagger for Chinese Texts , 1997 .

[12]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[13]  Ellen M. Voorhees,et al.  The Sixth Text REtrieval Conference (TREC-6) , 2000, Inf. Process. Manag..

[14]  Kui-Lam Kwok,et al.  English-Chinese Cross-Language Retrieval based on a Translation Package , 1999 .