Comparing representations in Chinese information retrieval

Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach.

[1]  Ogawa Yasushi,et al.  A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR 1995.

[2]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[3]  W. Bruce Croft,et al.  Chinese Information Extraction and Retrieval , 1996, TIPSTER.

[4]  Kui-Lam Kwok,et al.  Experiments with a component theory of probabilistic information retrieval based on single terms as document components , 1990, TOIS.

[5]  J. Ponte USe: A Retargetable Word Segmentation Procedure for Information Retrieval , 1996 .

[6]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[8]  Sun Maosong,et al.  CSeg&Tagl.0: A Practical Word Segmenter and POS Tagger for Chinese Texts , 1997 .

[9]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[10]  Jian-Yun Nie,et al.  On Chinese text retrieval , 1996, SIGIR '96.

[11]  L. Tyne,et al.  Optimal Weight Assignment for a Chinese Signature File , 1996, Inf. Process. Manag..

[12]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[13]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[14]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[15]  Lee-Feng Chien Fast and quasi-natural language search for gigabytes of Chinese texts , 1995, SIGIR '95.

[16]  Maosong Sun,et al.  CSeg&Tagl.0: A Practical Word Segmenter and POS Tagger for Chinese Texts , 1997, ANLP.

[17]  Kui-Lam Kwok,et al.  A network approach to probabilistic information retrieval , 1995, TOIS.

[18]  Kui-Lam Kwok,et al.  TREC-5 English and Chinese Retrieval Experiments using PIRCS , 1996, TREC.

[19]  Chris Buckley,et al.  Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[20]  Gwyneth Tseng,et al.  ACTS: An Automatic Chinese Text Segmentation System for Full Text Retrieval , 1995, J. Am. Soc. Inf. Sci..

[21]  Eugene Ching,et al.  Chinese-English dictionary of modern usage , 1972 .

[22]  Yasushi Ogawa,et al.  A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR '95.

[23]  Suh-Yin Lee,et al.  Optimal weight assignment for a Chinese signature file , 1996 .

[24]  Gwyneth Tseng,et al.  ACTS: an automatic Chinese text segmentation system for full text retrieval , 1995 .