论文信息 - Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm

Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm

"Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria or procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation. We employ the c4.5 learning algorithm for this task. Several attributes such as string length, frequency, mutual information and entropy are chosen for word/non-word determination. Our experiment yields high precision results about 85% in both training and test corpus.

[1] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[2] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[3] Marti A. Hearst,et al. Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[4] W ChurchKenneth,et al. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus , 2001 .

[5] Surapant Meknavin,et al. Feature-based Thai Word Segmentation , 1997 .

[6] Makoto Nagao,et al. A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese , 1994, COLING.

[7] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[8] Hozumi Tanaka,et al. The Automatic Extraction of Open Compounds from Text Corpora , 1996, COLING.

[9] David M. Magerman. Statistical Decision-Tree Models for Parsing , 1995, ACL.