Experiments on Unsupervised Chinese Word Segmentation and Classification

There are several problems encountered for Chinese language processing as Chinese is written without word delimiters. The difficulty in defining a word makes it even harder. This paper explores the possibility of automatically segmenting Chinese character sequences into words and classifying these words through distributional analysis in contrast with the usual approaches that depends on dictionaries.

[1]  Dale Schuurmans,et al.  Self-Supervised Chinese Word Segmentation , 2001, IDA.

[2]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[3]  David M. W. Powers,et al.  Unsupervised learning of linguistic structure An empirical evaluation , 2003 .

[4]  Mitchell P. Marcus,et al.  Parsing a Natural Language Using Mutual Information Statistics , 1990, AAAI.

[5]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[6]  Christopher S. G. Khoo,et al.  A new statistical formula for Chinese text segmentation incorporating contextual information , 1999, SIGIR '99.

[7]  Jianfeng Gao,et al.  Extraction of Chinese Compound Words - An Experimental Study on a Very Large Corpus , 2000, ACL 2000.

[8]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[9]  Steven Finch,et al.  Finding structure in language , 1995 .

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Z. Harris From Phoneme to Morpheme , 1955 .

[12]  Michael D. Alder,et al.  Finding Structure via Compression , 1998, CoNLL.

[13]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[14]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[15]  André Kempe,et al.  Experiments in Unsupervised Entropy-Based Corpus Segmentation , 1999, CoNLL.

[16]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[17]  A. Ross Structural Linguistics , 1953, Nature.

[18]  Keh-Yih Su,et al.  Automatic Construction of a Chinese Electronic Dictionary , 1995, VLC@ACL.