Extracting Chinese Multi-Word Units from Large-Scale Balanced Corpus

Automatic Multi-word Units Extraction is an important issue in Natural Language Processing. This paper has proposed a new statistical method based on a large-scale balanced corpus to extract multi-word units. We have used two improved traditional parameters: mutual information and log-likelihood ratio, and have increased the precision for the top 10,000 words extracted through the method to 80.13%. The results of the research indicate that this method is more efficient and robust than previous multi-word units extraction methods.

[1]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[2]  Joana Paulo Pardal,et al.  Using Morphological, Syntactical, and Statistical Information for Automatic Term Acquisition , 2002, PorTAL.

[3]  L. Dekang,et al.  Extracting collocations from text corpora , 1998 .

[4]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[5]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[6]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[7]  Keh-Yih Su,et al.  Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count , 1993, ROCLING/IJCLCLP.

[8]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[9]  A. L. MACKAY,et al.  Scientific and Technical Translation , 1958, Nature.

[10]  Clement T. Yu,et al.  A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[11]  Pascale Fung Extracting Key Terms from Chinese and Japanese texts , 1998 .

[12]  Juan C. Sager,et al.  A practical course in terminology processing , 1990 .

[13]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .

[14]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[15]  SmadjaFrank Retrieving collocations from text , 1993 .

[16]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[17]  Julio Gonzalo,et al.  Corpus-based terminology extraction applied to information access , 2001 .

[18]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[19]  Patrick Pantel,et al.  A Statistical Corpus-Based Term Extractor , 2001, Canadian Conference on AI.

[20]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[21]  Sophia Ananiadou,et al.  Identifying contextual information for multi-word term extraction , 1999 .