Enhanced intelligent text categorization using concise keyword analysis

Supervised learning is a popular approach to text classification among the research community as well as within software development industry. It enables intelligent systems to solve various text analysis problems such as document organization, spam detection and report scoring. However, the extremely difficult and time intensive process of creating a training corpus makes it inapplicable to many text classification problems. In this research, we explored the opportunities of addressing this pitfall by studying the ontological characteristics of document categories and grouping them under virtual super-categories to narrow down the search for a suitable category. Applying this method showed that classifier performance has greatly improved despite the relatively small size of the training corpus.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Jun Ho Huh,et al.  Hybrid spam filtering for mobile communication , 2009, Comput. Secur..

[3]  Susan T. Dumais,et al.  The Combination of Text Classifiers Using Reliability Indicators , 2016, Information Retrieval.

[4]  Biju Issac,et al.  Analysis of supervised text classification algorithms on corporate sustainability reports , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[7]  Wen Li,et al.  Two-level hierarchical combination method for text classification , 2011, Expert Syst. Appl..

[8]  Qiang Shen,et al.  Computational Intelligence and Feature Selection - Rough and Fuzzy Approaches , 2008, IEEE Press series on computational intelligence.

[9]  Harry Zhang,et al.  An Extensive Empirical Study on Semi-supervised Learning , 2010, 2010 IEEE International Conference on Data Mining.

[10]  Olivier Chapelle,et al.  A taxonomy of semi-supervised learning algorithms , 2005 .

[11]  Daniel Morariu,et al.  Improving a SVM Meta-classifier for Text Documents by using Naive Bayes , 2010, Int. J. Comput. Commun. Control.

[12]  Hui Gao,et al.  A new feature weighting method based on probability distribution in imbalanced text classification , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[13]  Yuanyuan Wang,et al.  A rough margin based support vector machine , 2008, Inf. Sci..

[14]  Biju Issac,et al.  Intelligent spam classification for mobile text message , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[15]  Dunja Mladenic,et al.  OntoGen: Semi-automatic Ontology Editor , 2007, HCI.

[16]  Hong-qi Han,et al.  Semi-supervised text classification from unlabeled documents using class associated words , 2009, 2009 International Conference on Computers & Industrial Engineering.