Automatic thesaurus generation for Chinese documents

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically, It also achieves a similar percentage level of term relatedness.

[1]  Gerard Salton,et al.  Experiments in Automatic Thesaurus Construction for Information Retrieval , 1971, IFIP Congress.

[2]  Hsinchun Chen,et al.  A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Ido Dagan,et al.  Mining Text Using Keyword Distributions , 1998, Journal of Intelligent Information Systems.

[4]  Hsin-Hsi Chen,et al.  Identification and Classification of Proper Nouns in Chinese Texts , 1996, COLING.

[5]  Yuen-Hsien Tseng,et al.  Content-based retrieval for music collections , 1999, SIGIR '99.

[6]  Hsin-Hsi Chen,et al.  Construction of a Chinese-English WordNet and its application to CLIR , 2000, IRAL '00.

[7]  Yuen-Hsien Tseng Multilingual keyword extraction for term suggestion , 1998, SIGIR '98.

[8]  Key-Sun Choi,et al.  Automatic thesaurus construction using Bayesian networks , 1995, CIKM '95.

[9]  Carolyn J. Crouch,et al.  Experiments in automatic statistical thesaurus construction , 1992, SIGIR '92.

[10]  Hsinchun Chen,et al.  Automatic Thesaurus Generation for an Electronic Community System , 1995, J. Am. Soc. Inf. Sci..

[11]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[12]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[13]  Hsinchun Chen,et al.  A concept space approach to addressing the vocabulary problem in scientific information retrieval: an experiment on the worm community system , 1997 .

[14]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[15]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[16]  Takenobu Tokunaga,et al.  Combining multiple evidence from different types of thesaurus for query expansion , 1999, SIGIR '99.

[17]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[18]  Ulrich Güntzer,et al.  Automatic thesaurus construction by machine learning from retrieval sessions , 1989, Inf. Process. Manag..

[19]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[20]  Fredric C. Gey,et al.  Chinese text retrieval without using a dictionary , 1997, SIGIR '97.

[21]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[22]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[23]  Edward A. Fox,et al.  Lexical relations: enhancing effectiveness of information retrieval systems , 1980, SIGF.

[24]  Yuen-Hsien Tseng Automatic cataloguing and searching for retrospective data by use of OCR text , 2001 .

[25]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .