A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "reconstruct" the word-boundary information. In this paper we present a stochastic finite-state model wherein the basic workhorse is the weighted finite-state transducer. The model segments Chinese text into dictionary entries and words derived by various productive lexical processes, and---since the primary intended application of this model is to text-to-speech synthesis---provides pronunciations for these words. We evaluate the system's performance by comparing its segmentation "judgments" with the judgements of a pool of human segmenters, and the system is shown to perform quite well.

[1]  Joseph L. Zinnes,et al.  Theory and Methods of Scaling. , 1958 .

[2]  趙 元任,et al.  A grammar of spoken Chinese = 中國話的文法 , 1968 .

[3]  Y. R. Chao,et al.  A Grammar of Spoken Chinese , 1970 .

[4]  William S-Y. Wang,et al.  The Chinese Language , 1973 .

[5]  John Defrancis The Chinese Language: Fact and Fantasy , 1986 .

[6]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[7]  Chilin Shih,et al.  The prosodic domain of tone sandhi in Chinese , 1986 .

[8]  Richard Sproat,et al.  Constituent-Based Morphological Parsing: A New Approach to the Problem of Word-Recognition , 1987, ACL.

[9]  Allan R. Wilks,et al.  The new S language: a programming environment for data analysis and graphics , 1988 .

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  Charles N. Li,et al.  Mandarin Chinese: A Functional Reference Grammar , 1989 .

[12]  Mark Liberman,et al.  A Finite-State Morphological Processor For Spanish , 1990, COLING.

[13]  David W. Scott The New S Language , 1990 .

[14]  R. Sproat A statistical method for finding word boundaries in Chinese text , 1990 .

[15]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[16]  Chao-Huang Chang,et al.  Recognizing Unregistered Names for Mandarin Word Identification , 1992, COLING.

[17]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[18]  Keh-Jiann Chen,et al.  Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[19]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[20]  Evan L. Antworth,et al.  PC-KIMMO: A Two-Level Processor for Morphological Analysis , 1992 .

[21]  Lauri Karttunen,et al.  Two-Level Morphology with Composition , 1992, COLING.

[22]  Richard Sproat,et al.  Morphology and computation , 1992 .

[23]  Keh-Yih Su,et al.  A Preliminary Study On Unknown Word Problem In Chinese Word Segmentation , 1993, ROCLING/IJCLCLP.

[24]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[25]  Zimin Wu,et al.  Chinese Text Segmentation for Text Retrieval: Achievements and Problems , 1993, J. Am. Soc. Inf. Sci..

[26]  Mehryar Mohri Analyse et representation par automates de structures syntaxiques composees. Application aux completives , 1993 .

[27]  Fernando Pereira,et al.  Weighted Rational Transductions and their Application to Human Language Processing , 1994, HLT.

[28]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[29]  Richard Sproat English noun-phrase accent prediction for text-to-speech , 1994, Comput. Speech Lang..

[30]  Pascale Fung,et al.  Statistical Augmentation of a Chinese Machine-Readable Dictionary , 1994, ArXiv.

[31]  Pascale Fung,et al.  Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition , 1994, ANLP.

[32]  Masaaki Nagata,et al.  A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm , 1994, COLING.

[33]  R. Sproat,et al.  A corpus-based analysis of Mandarin nominal root compound , 1996 .