Boosting-Based Ensemble Learning with Penalty Setting Profiles for Automatic Thai Unknown Word Recognition

A boosting-based ensemble learning can be used to improve classification accuracy by using multiple classification models constructing to cope with errors obtained from preceding steps. This paper presents an application of the boosting-based ensemble learning with penalty setting profiles on automatic unknown word recognition in Thai. Treating a sequential task as a non-sequential problem requires us to rank a set of generated candidates for a potential unknown word position. Since the correct candidate might not located at the highest rank among those candidates in the set, the proposed method provides penalties, in the form of a penalty setting profile, to improper ranking in order to reconstruct the succeeding classification model. In addition a number of alternative penalty setting profiles are introduced and their performances are compared on the task of extracting unknown words from a large Thai medical text. Using the naive Bayes as the base classifier for ensemble learning, the proposed method achieves the accuracy of 89.24%, which is an improvement of 9.91%, 7.54%, 5.25% over conventional naive Bayes, nonensemble version, and flat penalty setting profile.

[1]  Choochart Haruechaiyasak,et al.  A Collaborative Framework for Collecting Thai Unknown Words from the Web , 2006, ACL.

[2]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[3]  Thanaruk Theeramunkong,et al.  Pattern-Based Features vs. Statistical-Based Features in Decision Trees for Word Segmentation , 2004, IEICE Trans. Inf. Syst..

[4]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[5]  Hozumi Tanaka,et al.  The Automatic Extraction of Open Compounds from Text Corpora , 1996, COLING.

[6]  Thanaruk Theeramunkong,et al.  A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques , 2009, IEICE Trans. Inf. Syst..

[7]  Boonserm Kijsirikul,et al.  Feature-based Thai unknown word boundary identification using Winnow , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).