A Word-Finding Automaton for Chinese Sentence Tokenization

Word is the shortest meaningful unit in Chinese texts. Thus, in any Chinese natural language processing systems, sentence tokenization has to be performed first. Previous methods for sentence tokenization include statistical approach, string matching approach and a hybrid of them. Most of the methods select the best tokenization candidate as the output. In some occasion, more than one output is possible. To determine that, the input string that makes up the sentence requires at least three scans and a number of back-trackings. This paper proposes a words finding automaton that outputs more than one tokenization candidate, if there is, for a string by making only one scan of the input string without back-tracking. Our algorithm reduces the segmentation time by 30 percent with a slight increase in the dictionary size in the memory. Our algorithm can also be used for tokenizing other Asian languages such as Japanese and Korean.