Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification

Automatic text categorization is still a very important research topic. Typical applications include assisting end-users in archiving existing documents, or helping them in browsing existing corpus of documents in a hierarchical way. Text categorization is usually composed of two main steps: keyword extraction and classification. In this paper, a corpus of documents is represented by a binary relation linking each document to the words it contains. From this relation, the Hyper Rectangle Algorithm extracts the list of the most representative words in a hierarchical way. A hyper-Rectangle associated to an element of the range of a binary relation is the union of all non-enlargeable rectangles containing it. The extracted keywords are fed into the random forest classifier in order to predict the category of each document. The method has been validated on the popular Reuters 21578 news articles database. Results are very promising and show the effectiveness of the Hyper Rectangular method in extracting relevant keywords.

[1]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[2]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[3]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  M. S. Josephine,et al.  Scaling Down Dimensions and Feature Extraction in Document Repository Classification , 2014 .

[6]  Ju Cheng Yang,et al.  Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet , 2012, Expert Syst. Appl..

[7]  Dino Isa,et al.  An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization , 2011, Applied Intelligence.

[8]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[9]  Sen Jia,et al.  A novel feature voting model for text classification , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[10]  Bernhard Ganter,et al.  Two Basic Algorithms in Concept Analysis , 2010, ICFCA.

[11]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[12]  Lawrence D. Fu,et al.  A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization , 2014, J. Assoc. Inf. Sci. Technol..

[13]  Tomoharu Iwata,et al.  Latent Support Measure Machines for Bag-of-Words Data Classification , 2014, NIPS.

[14]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[15]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .