This paper discusses problems of word and sentence segmentation in Thai. Disagreements on word segmentation are caused mostly from compound words. To set a standard resource and tool of word segmentation, we suggest that only simple words and true compound words should be segmented in the process of word segmentation. Other compounds can be grouped later by the same means as multiword identification in other languages. Sentence segmentation is also difficult because the boundary of sentence in Thai is fuzzy. We suggest that a discourse should be seen as a combination of clauses rather than sentences. Some discourse clues then can be used to segment these discourse units. The result from sentence segmentation module could be a sequence of segments composed of clauses, which then can be constructed into the discourse structure.
[1]
Pasi Tapanainen,et al.
What is a word, What is a sentence? Problems of Tokenization
,
1994
.
[2]
Hermann Ney,et al.
Sentence segmentation using IBM word alignment model 1
,
2005,
EAMT.
[3]
Wirote Aroonmanakun.
Referent resolution for zero pronouns in Thai
,
1997
.
[4]
Claire Cardie,et al.
An Analysis of Statistical and Syntactic Phrases
,
1997,
RIAO.
[5]
David D. Palmer.
SATZ - An Adaptive Sentence Segmentation System
,
1995,
ArXiv.
[6]
Wirote Aroonmanakun,et al.
Collocation and Thai Word Segmentation
,
2002
.
[7]
Virach Sornlertlamvanich,et al.
Automatic Sentence Break Disambiguation for Thai
,
2001
.
[8]
Pradit Mittrapiyanuruk,et al.
THE AUTOMATIC THAI SENTENCE EXTRACTION
,
2000
.