Contextual behaviour features and grammar rules for Thai sentence-breaking

Statistical approach with surrounding context around a space was widely used as a main feature for Thai sentence-breaking. However, it does not represent a contextual behaviour regarding an entire context in a sentence. Moreover, it does not take an advantage of Thai grammar rules to determine a sentence boundary. This paper proposes the use of a hybrid approach integrating between rule-based method and statistical approach using contextual behaviour features reflecting natural language behaviour for Thai sentence-breaking. The performance of Thai sentence-breaking using a number of words in a chunk, existence of verb, and rules are compared. Experimental results show that using a number of words in a chunk achieves higher accuracy than other features. Moreover, integration of those features and rule-based method achieves better accuracy. The space-correct and false-break scores are 93.54% and 2.99% respectively.