Chinese Word Segmentation as Character Tagging

In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi, tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach is competitive against other supervised machine-learning segmenters reported in previous studies, achieving precision and recall rates of 95.01% and 94.94% respectively, trained on a 237K-word training set.

[1]  David D. Palmer,et al.  A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[2]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[3]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[4]  Xiang-ling Dai,et al.  Chinese Morphology and its Interface with the Syntax , 1992 .

[5]  Andi Wu,et al.  Customizable Segmentation of Morphologically Derived Words in Chinese , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[6]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[7]  J. Packard The Morphology of Chinese: A Linguistic and Cognitive Approach , 2000 .

[8]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[9]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[10]  Martha Palmer,et al.  A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception , 1996, Comput. Linguistics.

[11]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[12]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[13]  Nianwen Xue,et al.  Defining and automatically identifying words in Chinese , 2002 .

[14]  Pascale Fung,et al.  Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition , 1994, ANLP.

[15]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[16]  Wanda Pratt,et al.  Discovering Chinese words from unsegmented text (poster abstract) , 1999, SIGIR '99.

[17]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[18]  Jin Guo,et al.  Critical Tokenization and its Properties , 1997, Comput. Linguistics.

[19]  Andi Wu,et al.  Word Segmentation In Sentence Analysis , 1998 .

[20]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.