论文信息 - Chinese Word Segmentation as Character Tagging

Chinese Word Segmentation as Character Tagging

In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi, tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach is competitive against other supervised machine-learning segmenters reported in previous studies, achieving precision and recall rates of 95.01% and 94.94% respectively, trained on a 237K-word training set.

Nianwen Xue | Nianwen Xue

[1] David D. Palmer,et al. A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[2] Maosong Sun,et al. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[3] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[4] Xiang-ling Dai,et al. Chinese Morphology and its Interface with the Syntax , 1992 .

[5] Andi Wu,et al. Customizable Segmentation of Morphologically Derived Words in Chinese , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[6] Mitchell P. Marcus,et al. Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[7] J. Packard. The Morphology of Chinese: A Linguistic and Cognitive Approach , 2000 .

[8] Mitchell P. Marcus,et al. Maximum entropy models for natural language ambiguity resolution , 1998 .

[9] Chilin Shih,et al. A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[10] Martha Palmer,et al. A Statistically Emergent Approach for Language Processing: Application to Modeling Context Effects in Ambiguous Chinese Word Boundary Perception , 1996, Comput. Linguistics.

[11] Richard Sproat,et al. A statistical method for finding word boundaries in Chinese text , 1990 .