A delimiter-based general approach for Chinese term extraction

This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates themselves. This delimiter-based method is much more stable and domain independent than the previous approaches. In the second step on term verification, an algorithm using link analysis is applied to calculate the relevance between term candidates and the sentences from which the terms are extracted. All information is obtained from the working domain corpus without the need for prior domain knowledge. The approach is not targeted at any specific domain and there is no need for extensive training when applying it to new domains. In other words, the method is not domain dependent and it is especially useful for resource-limited domains. Evaluations of Chinese text in two different domains show quite significant improvements over existing techniques and also verify its efficiency and its relatively domain-independent nature. The proposed method is also very effective for extracting new terms so that it can serve as an efficient tool for updating domain knowledge, especially for expanding lexicons. © 2010 Wiley Periodicals, Inc.

[1]  Jing-Shin Chang,et al.  Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora , 2005, IJCNLP.

[2]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.

[3]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[4]  Yuen-Hsien Tseng,et al.  Automatic thesaurus generation for Chinese documents , 2002, J. Assoc. Inf. Sci. Technol..

[5]  Virach Sornlertlamvanich,et al.  Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm , 2000, COLING.

[6]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[7]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[8]  Lee-Feng Chien,et al.  PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval , 1999, Inf. Process. Manag..

[9]  Qin Lu,et al.  Chinese Terminology Extraction Using Window-Based Contextual Information , 2009, CICLing.

[10]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[11]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[12]  Maosong Sun,et al.  Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures , 2003, SIGHAN.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[15]  Song Han,et al.  Automatic Identification of Chinese Stop Words , 2006 .

[16]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[17]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[18]  Toru Hisamitsu,et al.  A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words , 2002, COLING.

[19]  Jenq-Haur Wang,et al.  Exploiting the Web as the multilingual corpus for unknown query translation , 2006 .

[20]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[21]  Daniel Jurafsky,et al.  Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem? , 2001, EMNLP.

[22]  Chung-Hsien Wu,et al.  Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology , 2002, TALIP.

[23]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..