A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.

[1]  Keh-Yih Su,et al.  A Preliminary Study On Unknown Word Problem In Chinese Word Segmentation , 1993, ROCLING/IJCLCLP.

[2]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[3]  Keh-Yih Su,et al.  An Unsupervised Iterative Method for Chinese New Lexicon Extraction , 1997, ROCLING/IJCLCLP.

[4]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[5]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[6]  Uri Zernik,et al.  Lexical acquisition: Exploiting on-line resources to build a lexicon. , 1991 .

[7]  Terry Winograd,et al.  Understanding natural language , 1974 .

[8]  Padhraic Smyth,et al.  Discovering Chinese Words from Unsegmented Text , 1999, SIGIR 1999.

[9]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[10]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.

[11]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[12]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[13]  Hsin-Hsi Chen,et al.  The Identification of Organization Names in Chinese Texts , 1994 .

[14]  Keh-Jiann Chen,et al.  Unknown Word Detection for Chinese by a Corpus-based Learning Method , 1998, ROCLING/IJCLCLP.

[15]  Kenneth Ward Church,et al.  Using Statistics in Lexical Analysis , 2003, Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.

[16]  R McKeownKathleen,et al.  Translating collocations for bilingual lexicons , 1996 .

[17]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[18]  Keh-Yih Su,et al.  Statistical Models for Word Segmentation And Unknown Word Resolution , 1992, ROCLING.

[19]  David D. Palmer,et al.  A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[20]  James F. Allen Natural language understanding (2nd ed.) , 1995 .

[21]  Wanda Pratt,et al.  Discovering Chinese words from unsegmented text (poster abstract) , 1999, SIGIR '99.

[22]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.