Stop Word in Readability Assessment of Thai Text

Teachers and parents may use readability to select appropriate learning materials for primary school students. This research constructs Thai stop word list and evaluates the impact of eliminating stop words on readability assessment of Thai text. The corpus contains 1,188 textbook articles used by students from grade 1 to grade 6. Word segmentation, stop word list extraction, and feature selection are the preprocessing tasks performed on the articles in the corpus. Then, term frequency and inverse document frequency (TF-IDF) of the selected terms are used as features for support vector machines (SVMs) to generate classification models. Experimental results show that F-measure can reach 0.87 when identifying Thai articles suitable for middle grades primary school students.

[1]  Jureeporn Kanjanakaroon Relationship between Adversity Quotient and Self-empowerment of Students in Schools under the Jurisdiction of the Office of the Basic Education Commission , 2012 .

[2]  Sirma Yavuz,et al.  An automated domain specific stop word generation method for natural language text classification , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[3]  Yaw-Huei Chen,et al.  Using word segmentation and SVM to assess readability of Thai text for primary school students , 2011, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE).

[4]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5]  Chong-Wah Ngo,et al.  Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study , 2010, IEEE Transactions on Multimedia.

[6]  Wirote Aroonmanakun,et al.  Collocation and Thai Word Segmentation , 2002 .

[7]  Clement T. Yu,et al.  Stop Word and Related Problems in Web Interface Integration , 2009, Proc. VLDB Endow..

[8]  Xiaotie Deng,et al.  Automatic construction of Chinese stop word list , 2006 .

[9]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[10]  Mari Ostendorf,et al.  A machine learning approach to reading level assessment , 2009, Comput. Speech Lang..

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[13]  Pascal Matsakis,et al.  Evaluation of stop word lists in text retrieval using Latent Semantic Indexing , 2011, 2011 Sixth International Conference on Digital Information Management.