Automatic classification of academic documents using text mining techniques

In this work an automatic classifier of undergraduate final projects based on text mining is presented. The dataset, comprising documents from four professional categories, was represented by means the vector space model with different index metrics. Also, a number of techniques for reduction dimensionality were applied over the word space. In order to construct the classification model the K-nearest neighbor algorithm was applied. Using 10-fold cross-validations we could obtain 82% of predictive accuracy. However, we achieved an accuracy of 95% with a recommendation of up to two categories taking into account the interdisciplinary in documents. This classifier was integrated into an application for automatic assignment of reviewers, which performs this assignation from teachers who belong to the areas recommended.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  V. Rao Vemuri,et al.  Using Text Categorization Techniques for Intrusion Detection , 2002, USENIX Security Symposium.

[5]  Irena Koprinska,et al.  INTIMATE: a Web-based movie recommender using text categorization , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[6]  Manuel de Buenaga Text Categorization for Internet Content Filtering , 2004 .

[7]  Manuel de Buenaga Rodríguez,et al.  Text Categorization for Internet Content Filtering , 2004, Inteligencia Artif..

[8]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[9]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[10]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[11]  René Venegas,et al.  Clasificación de textos académicos en función de su contenido léxico-semántico , 2007 .

[12]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[13]  Eleazar Botta-Ferret,et al.  Minería de textos: una herramienta útil para mejorar la gestión del bibliotecario en el entorno digital , 2007 .

[14]  Carmen Gálvez,et al.  Minería de textos: la nueva generación de análisis de literatura científica en biología molecular y genómica , 2008 .

[15]  Alberto Téllez-Valero,et al.  Using Machine Learning for Extracting Information from Natural Disaster News Reports , 2009, Computación y Sistemas.

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Gurpreet Singh Lehal,et al.  A Survey of Text Mining Techniques and Applications , 2009 .

[18]  Angel Cobo Ortega,et al.  Descubrimiento de conocimiento en repositorios documentales mediante técnicas de Minería de Texto y Swarm Intelligence , 2009 .

[19]  Qing Li,et al.  Mining Social Response to Crisis via Electronic Media , 2010, 2010 International Conference on Management of e-Commerce and e-Government.

[20]  G. Aghila,et al.  Text Mining Process, Techniques and Tools : an Overview , 2010 .

[21]  Khairullah Khan,et al.  A Review of Machine Learning Algorithms for Text-Documents Classification , 2010 .

[22]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[23]  Son Doan,et al.  Syndromic Classification of Twitter Messages , 2011, eHealth.

[24]  Luis Alfonso Ureña López,et al.  Técnicas de clasificación de opiniones aplicadas a un corpus en español , 2011, Proces. del Leng. Natural.

[25]  Sven Rill,et al.  The Migraine Radar - A Medical Study Analyzing Twitter Messages? , 2011 .

[26]  S. Niharika,et al.  A SURVEY ON TEXT CATEGORIZATION , 2012 .

[27]  Dayana R. Torres Minería de textos para la asignación automática de jurados a Trabajos Especiales de Grado , 2014 .