论文信息 - A Word-Finding Automaton for Chinese Sentence Tokenization

A Word-Finding Automaton for Chinese Sentence Tokenization

Word is the shortest meaningful unit in Chinese texts. Thus, in any Chinese natural language processing systems, sentence tokenization has to be performed first. Previous methods for sentence tokenization include statistical approach, string matching approach and a hybrid of them. Most of the methods select the best tokenization candidate as the output. In some occasion, more than one output is possible. To determine that, the input string that makes up the sentence requires at least three scans and a number of back-trackings. This paper proposes a words finding automaton that outputs more than one tokenization candidate, if there is, for a string by making only one scan of the input string without back-tracking. Our algorithm reduces the segmentation time by 30 percent with a slight increase in the dictionary size in the memory. Our algorithm can also be used for tokenizing other Asian languages such as Japanese and Korean.

Kim-Teng Lua

[1] Keh-Yih Su,et al. Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count , 1993, ROCLING/IJCLCLP.

[2] Keh-Yih Su,et al. Statistical Models for Word Segmentation And Unknown Word Resolution , 1992, ROCLING.

[3] Haizhou Li,et al. Chinese Word Segmentation , 1998, PACLIC.

[4] Wanying Jin. Chinese Segmentation Disambiguation , 1994, COLING.

[5] Kurt Mehlhorn,et al. Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity , 1990 .

[6] Daniel I. A. Cohen. Introduction to computer theory (revised ed.) , 1991 .

[7] Gwyneth Tseng,et al. Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[8] Keh-Jiann Chen,et al. Word Identification for Mandarin Chinese Sentences , 1992, COLING.

[9] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[10] Tzusheng Pei,et al. A Parsing Method for Identifying Words in Mandarin Chinese Sentences , 1991, IJCAI.

[11] Jin Guo,et al. Critical Tokenization and its Properties , 1997, Comput. Linguistics.

[12] R. Sproat. A statistical method for finding word boundaries in Chinese text , 1990 .