Feature-based Thai Word Segmentation

Word segmentation is a problem in several Asian languages that have no explicit word boundary delimiter, e.g. Chinese, Japanese, Korean and Thai. We propose to use feature-based approaches for Thai word segmenta-tion. A feature can be anything that tests for speciic information in the context around the word in question, such as context words and collocations. To automatically extract such features from a training corpus, we employ two learning algorithms, namely RIP-PER and Winnow. Experimental results show that both algorithms appear to outper-form the existing Thai word segmentation methods, especially for context-dependent strings.