Automatic thesaurus for enhanced Chinese text retrieval

Asian languages such as Japanese, Korean and in particular Chinese, are beginning to gain popularity in the information retrieval (IR) domain. The quality of IR systems has traditionally been judged by the system’s retrieval effectiveness which, in turn, is commonly measured by data recall and data precision. This paper proposes and describes a process for generating an automatic Chinese thesaurus that can be used to provide related terms to a user’s queries to enhance retrieval effectiveness. In the absence of existing automatic Chinese thesauri, techniques used in English thesaurus generation have been evaluated and adapted to generate a Chinese equivalent. The automatic thesaurus is generated by computing the co‐occurrence values between domain‐specific terms found in a document collection. These co‐occurrence values are in turn derived from the term and document frequencies of the terms. A set of experiments was subsequently carried out on a document test set to evaluate the applicability of the thesaurus. Results obtained from these experiments confirmed that such an automatic generated thesaurus is able to improve the retrieval effectiveness of a Chinese IR system.

[1]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[2]  Kui-Lam Kwok Comparing representations in Chinese information retrieval , 1997, SIGIR '97.

[3]  Natasa Milic-Frayling,et al.  Experiments on Chinese Text Indexing -- CLARIT TREC-5 Chinese Track Report , 1996, TREC.

[4]  K. J. Lynch,et al.  Automatic construction of networks of concepts characterizing document databases , 1992, IEEE Trans. Syst. Man Cybern..

[5]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[6]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[7]  Gerard Salton,et al.  Generation and search of clustered files , 1978, TODS.

[8]  Gerald Salton,et al.  Automatic text processing , 1988 .

[9]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[10]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[11]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[12]  Alan F. Smeaton,et al.  Spanish and Chinese Document Retrieval in TREC-5 , 1996, TREC.

[13]  Hong Koon. Lim Chinese text retrieval system , 1999 .

[14]  Peter Willett,et al.  The Limitations of Term Co-Occurrence Data for Query Expansion in Document Retrieval Systems , 1991 .

[15]  Kui-Lam Kwok Lexicon Effects on Chinese Information Retrieval , 1997, EMNLP.

[16]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .