论文信息 - Comparing representations in Chinese information retrieval

Comparing representations in Chinese information retrieval

Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach.

Kui-Lam Kwok

[1] Ogawa Yasushi,et al. A new character-based indexing method using frequency data for Japanese documents , 1995, SIGIR 1995.

[2] Keh-Jiann Chen,et al. Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[3] W. Bruce Croft,et al. Chinese Information Extraction and Retrieval , 1996, TIPSTER.

[4] Kui-Lam Kwok,et al. Experiments with a component theory of probabilistic information retrieval based on single terms as document components , 1990, TOIS.

[5] J. Ponte. USe: A Retargetable Word Segmentation Procedure for Information Retrieval , 1996 .

[6] James Allan,et al. INQUERY at TREC-5 , 1996, TREC.

[7] Stephen E. Robertson,et al. Okapi at TREC-5 , 1996, TREC.

[8] Sun Maosong,et al. CSeg&Tagl.0: A Practical Word Segmenter and POS Tagger for Chinese Texts , 1997 .

[9] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[10] Jian-Yun Nie,et al. On Chinese text retrieval , 1996, SIGIR '96.

[11] L. Tyne,et al. Optimal Weight Assignment for a Chinese Signature File , 1996, Inf. Process. Manag..