SVMAUD: Using textual information to predict the audience level of written works using support vector machines

SVMAUD: USING TEXTUAL INFORMATION TO PREDICT THE AUDIENCE LEVEL OF WRITTEN WORKS USING SUPPORT VECTOR MACHINES by Todd Will Information retrieval systems should seek to match resources with the reading ability of the individual user; similarly, an author must choose vocabulary and sentence structures appropriate for his or her audience. Traditional readability formulas, including the popular Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score, rely on numerical representations of text characteristics, including syllable counts and sentence lengths, to suggest audience level of resources. However, the author’s chosen vocabulary, sentence structure, and even the page formatting can alter the predicted audience level by several levels, especially in the case of digital library resources. For these reasons, the performance of readability formulas when predicting the audience level of digital library resources is very low. Rather than relying on these inputs, machine learning methods, including cosine, Naïve Bayes, and Support Vector Machines (SVM), can suggest the grade level of an essay based on the vocabulary chosen by the author. The audience level prediction and essay grading problems share the same inputs, expert-labeled documents, and outputs, a numerical score representing quality or audience level. After a human expert labels a representative sample of resources with audience level, the proposed SVM-based audience level prediction program, SVMAUD, constructs a vocabulary for each audience level; then, the text in an unlabeled resource is compared with this predefined vocabulary to suggest the most appropriate audience level. Two readability formulas and four machine learning programs are evaluated with respect to predicting human-expert entered audience levels based on the text contained in an unlabeled resource. In a collection containing 10,238 expert-labeled HTML-based digital library resources, the Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score predict the specific audience level with F-measures of 0.10 and 0.05, respectively. Conversely, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD improve these F-measures to 0.57, 0.61, 0.68, and 0.78, respectively. When a term’s weight is adjusted based on the HTML tag in which it occurs, the specific audience level prediction performance of cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD improves to 0.68, 0.70, 0.75, and 0.84, respectively. When title, keyword, and abstract metadata is used for training, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD specific audience level prediction F-measures are found to be 0.61, 0.68, 0.75, and 0.86, respectively. When cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD are trained and tested using resources from a single subject category, the specific audience level prediction Fmeasure performance improves to 0.63, 0.70, 0.77, and 0.87, respectively. SVMAUD experiences the highest audience level prediction performance among all methods under evaluation in this study. After SVMAUD is properly trained, it can be used to predict the audience level of any written work. SVMAUD: USING TEXTUAL INFORMATION TO PREDICT THE AUDIENCE LEVEL OF WRITTEN WORKS USING SUPPORT VECTOR MACHINES

[1]  Weiyi Meng,et al.  A new study on using HTML structures to improve retrieval , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[2]  Salvatore Valenti,et al.  An Overview of Current Research on Automated Essay Grading , 2003, J. Inf. Technol. Educ..

[3]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[4]  Thomas P. Turner,et al.  Rising to the Top: Evaluating the Use of the HTML META Tag To Improve Retrieval of World Wide Web Documents through Internet Search Engines. , 1998 .

[5]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[6]  Susan T. Dumais,et al.  Using Latent Semantic Indexing for Literature Based Discovery , 1998, J. Am. Soc. Inf. Sci..

[7]  Wu Meng,et al.  Application of Support Vector Machines in Financial Time Series Forecasting , 2007 .

[8]  Abdellatif Rahmoun,et al.  Using WordNet for Text Categorization , 2008, Int. Arab J. Inf. Technol..

[9]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[10]  H. L. Penman The water cycle , 1970 .

[11]  Kevyn Collins-Thompson,et al.  Predicting reading difficulty with statistical language models , 2005, J. Assoc. Inf. Sci. Technol..

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  M. Abbas,et al.  Clustering DNA sequences by selforganizing map and similarity functions , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[14]  George R. Klare,et al.  The measurement of readability: useful information for communicators , 2000, AJCD.

[15]  D. Whittington,et al.  Approaches to the computerized assessment of free text responses , 1999 .

[16]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[17]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[18]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[19]  Weiyi Meng,et al.  Using the Structure of HTML Documents to Improve Retrieval , 1997, USENIX Symposium on Internet Technologies and Systems.

[20]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[21]  G. Harry McLaughlin,et al.  SMOG Grading - A New Readability Formula. , 1969 .

[22]  Raymond P. L. Buse,et al.  A metric for software readability , 2008, ISSTA '08.

[23]  Chih-Ping Wei,et al.  A Latent Semantic Indexing-based approach to multilingual document clustering , 2008, Decis. Support Syst..

[24]  Johan Bollen,et al.  Evaluation of the NSDL and Google for Obtaining Pedagogical Resources , 2005, ECDL.

[25]  James L. Peterson,et al.  Computer-based readability indexes , 1982, ACM '82.

[26]  Irem Dikmen,et al.  Strategic Group Analysis in the Construction Industry , 2009 .

[27]  Readability and Reading Ability. , 1998 .

[28]  Georg Rasch,et al.  Probabilistic Models for Some Intelligence and Attainment Tests , 1981, The SAGE Encyclopedia of Research Design.

[29]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[30]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[31]  G. Spache,et al.  A New Readability Formula for Primary-Grade Reading Materials , 1953, The Elementary School Journal.

[32]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[33]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[34]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[35]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[36]  D M D'Alessandro,et al.  The readability of pediatric patient education materials on the World Wide Web. , 2001, Archives of pediatrics & adolescent medicine.

[37]  Howard D. White Better Than Brief Tests: Coverage Power Tests of Collection Strength , 2008 .

[38]  Martin Chodorow,et al.  Enriching Automated Essay Scoring Using Discourse Marking , 2001 .

[39]  Roger B. Bradford,et al.  An empirical study of required dimensionality for large-scale latent semantic indexing applications , 2008, CIKM '08.

[40]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[41]  Il Im,et al.  Search Personalization: Knowledge-Based Recommendation in Digital Libraries , 2009, AMCIS.

[42]  Ralph Gomory,et al.  Comparative Advantage , 2021, The Palgrave Encyclopedia of Imperialism and Anti-Imperialism.

[43]  Samir Chatterjee,et al.  A Classifier to Evaluate Language Specificity of Medical Documents , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[44]  Ratan K. Guha,et al.  Detecting Obfuscated Viruses Using Cosine Similarity Analysis , 2007, First Asia International Conference on Modelling & Simulation (AMS'07).

[45]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[46]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[47]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[48]  E. B. Page Computer Grading of Student Prose, Using Modern Concepts and Software , 1994 .

[49]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[50]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[51]  William A. Gale,et al.  Good-Turing Smoothing Without Tears , 2001 .

[52]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[53]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[54]  R. Gunning The Technique of Clear Writing. , 1968 .

[55]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[56]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[57]  Clarence R. Stone,et al.  Measuring Difficulty of Primary Reading Material: A Constructive Criticism of Spache's Measure , 1956, The Elementary School Journal.

[58]  Xin Liu,et al.  Creating generic text summaries , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[59]  T. Kalamboukis,et al.  Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems ? , 2008 .

[60]  W. Bruce Croft,et al.  Automatic recognition of reading levels from user queries , 2004, SIGIR '04.

[61]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[62]  Jih Pin Yeh,et al.  Face Detection Based on Skin Color Segmentation and SVM Classification , 2008, 2008 Second International Conference on Secure System Integration and Reliability Improvement.

[63]  Susan T. Dumais,et al.  Using latent semantic indexing for literature based discovery , 1998 .

[64]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[65]  Jessica Daecher,et al.  The Water Cycle , 2016 .

[66]  Allen C. Browne,et al.  A balanced approach to health information evaluation: A vocabulary-based naïve Bayes classifier and readability formulas , 2008, J. Assoc. Inf. Sci. Technol..

[67]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[68]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[69]  Gabriella Pasi,et al.  Contextual weighted representations and indexing models for the retrieval of HTML documents , 2005, Soft Comput..

[70]  Jay H. Bernstein,et al.  From the Ubiquitous to the Nonexistent: A Demographic Study of OCLC WorldCat , 2006 .

[71]  Claudia Leacock,et al.  Automated evaluation of essays and short answers , 2001 .

[72]  Kevyn Collins-Thompson,et al.  A Language Modeling Approach to Predicting Reading Difficulty , 2004, NAACL.