论文信息 - Extracting Chinese Multi-Word Units from Large-Scale Balanced Corpus

Extracting Chinese Multi-Word Units from Large-Scale Balanced Corpus

Automatic Multi-word Units Extraction is an important issue in Natural Language Processing. This paper has proposed a new statistical method based on a large-scale balanced corpus to extract multi-word units. We have used two improved traditional parameters: mutual information and log-likelihood ratio, and have increased the precision for the top 10,000 words extracted through the method to 80.13%. The results of the research indicate that this method is more efficient and robust than previous multi-word units extraction methods.

Xiaohua Liu | Tingting He | Jianzhou Liu

[1] Jonathan D. Cohen. Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[2] Joana Paulo Pardal,et al. Using Morphological, Syntactical, and Statistical Information for Automatic Term Acquisition , 2002, PorTAL.

[3] L. Dekang,et al. Extracting collocations from text corpora , 1998 .

[4] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[5] Jonathan D. Cohen,et al. Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[6] Frank Smadja,et al. Retrieving Collocations from Text: Xtract , 1993, CL.

[7] Keh-Yih Su,et al. Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count , 1993, ROCLING/IJCLCLP.

[8] Kyo Kageura,et al. METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[9] A. L. MACKAY,et al. Scientific and Technical Translation , 1958, Nature.

[10] Clement T. Yu,et al. A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..

[11] Pascale Fung. Extracting Key Terms from Chinese and Japanese texts , 1998 .