A novel approach for ontology-based dimensionality reduction for web text document classification

Dimensionality reduction of feature vector size plays a vital role in enhancing the text processing capabilities; it aims in reducing the size of the feature vector used in the mining tasks (classification, clustering… etc.). This paper proposes an efficient approach to be used in reducing the size of the feature vector for web text document classification process. This approach is based on using WordNet ontology, utilizing the benefit of its hierarchal structure, to eliminate words from the generated feature vector that has no relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting method. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach using several experiments. The experimental results reveal the effectiveness of our proposed approach against other traditional approaches to achieve a better classification accuracy, F-measure, precision, and recall

[1]  Dina Said,et al.  DIMENSIONALITY REDUCTION TECHNIQUES FOR ENHANCING AUTOMATIC TEXT CATEGORIZATION , 2007 .

[2]  Catherine Comparot,et al.  Using Domain Ontologies for Classification and Semantic Interpretation of Documents , 2016, Big Data 2016.

[3]  Omkar Ardhapure,et al.  COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION , 2016 .

[4]  Lingling Meng,et al.  A Review of Semantic Similarity Measures in WordNet 1 , 2013 .

[5]  Baijian Yang,et al.  Big Data Dimension Reduction Using PCA , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).

[6]  Yadong Wang,et al.  Comparison among dimensionality reduction techniques based on Random Projection for cancer classification , 2016, Comput. Biol. Chem..

[7]  Mary Walowe Mwadulo,et al.  A Review on Feature Selection Methods For Classification Tasks , 2016 .

[8]  N. Venkata Sailaja,et al.  Survey of Text Mining Techniques, Challenges and their Applications , 2016 .

[9]  Masoumeh Zareapoor,et al.  Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection , 2015 .

[10]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[11]  Jun Wen,et al.  Text Categorization Based on a Similarity Approach , 2007 .

[12]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[13]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[14]  Sayali Rasane,et al.  Handling Various Issues In Text Classification : A Review , 2016 .

[15]  Xiaogang Peng,et al.  Document Classifications based on Word Semantic Hierarchies , 2005, Artificial Intelligence and Applications.

[16]  Khairullah Khan,et al.  An Overview of E-Documents Classification , 2011 .

[17]  J. Akilandeswari,et al.  A Survey on Semantic Similarity Measure , 2014 .

[18]  Hinrich Schütze,et al.  Introduction to Information Retrieval: XML retrieval , 2008 .

[19]  Lindsay I. Smith,et al.  A tutorial on Principal Components Analysis , 2002 .

[20]  Ruchika Malhotra,et al.  Techniques for text classification: Literature review and current trends , 2015, Webology.

[21]  Abdellatif Rahmoun,et al.  Using WordNet for Text Categorization , 2008, Int. Arab J. Inf. Technol..

[22]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[23]  Kerem Celik,et al.  A comprehensive analysis of using semantic information in text categorization , 2013, 2013 IEEE INISTA.

[24]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[25]  Tingting Wei,et al.  Measuring Word Semantic Relatedness Using WordNet-Based Approach , 2015, J. Comput..

[26]  D. S. Guru,et al.  Representation and Classification of Text Documents: A Brief Review , 2010 .

[27]  Qing-yun Dai,et al.  Research of DSP-based Embedded Systems Connected to the Internet , 2013 .

[28]  Pradnya Kumbhar,et al.  A Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification , 2016 .

[29]  C Kalaivani,et al.  A Survey Paper on Text Mining Techniques , 2016 .