A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques

In this today's technology, many of digital documents are being generated and available each day. However, it would cost a vast amount of time and human efforts to classify them in reasonable categories like important and unimportant, spam or no-spam. The text document classification tasks pass under the Automatic Classification (also known as pattern Recognition) problem in Machine Learning and Text Mining. It is necessary to classify large text documents into specific classes, to make clear and search simply. Classified data are easy for users to browse. The importance of common text document placement is the representation of the unknown text for some pre-categories as representations for survival. The Combination of classifiers is fused together to increase the accuracy classification result in a single text document. The contemplate text document classification depend on different representation model and fusion based classifiers are explained in the paper. In order to examine different techniques, Enhanced Sentence Vector Space Model (ES-VSM) and a Bigram is used to match the layout of a problem document. The result completed by assessing different current classifiers by looking accuracy of their performance in advance. This will explain and promote a willingness of new research participants to respond to challenging situations and respond to similar responses.

[1]  S. Thamarai Selvi,et al.  Text categorization using Rocchio algorithm and random forest algorithm , 2017, 2016 Eighth International Conference on Advanced Computing (ICoAC).

[2]  Jia Li,et al.  Word categorization from distributional information: Frames confer more than the sum of their (Bigram) parts , 2014, Cognitive Psychology.

[3]  María Lourdes Borrajo Diz,et al.  Improving the text classification using clustering and a novel HMM to reduce the dimensionality , 2016, Comput. Methods Programs Biomed..

[4]  A. Danti,et al.  Document Vector Space Representation Model for Automatic Text Classification , 2013 .

[5]  Vili Podgorelec,et al.  Text classification method based on self-training and LDA topic models , 2017, Expert Syst. Appl..

[6]  Basilio Sierra,et al.  A multiclass/multilabel document categorization system: Combining multiple classifiers in a reduced dimension , 2011, Appl. Soft Comput..

[7]  Federico Castanedo,et al.  A Review of Data Fusion Techniques , 2013, TheScientificWorldJournal.

[8]  Christoph Rensing,et al.  Text classification based filters for a domain-specific search engine , 2016, Comput. Ind..

[9]  Parag Kulkarni,et al.  Efficient Approach to find Bigram Frequency in Text Document using E-VSM , 2013 .

[10]  Siyang Wang,et al.  A new Centroid-Based Classification model for text categorization , 2017, Knowl. Based Syst..

[11]  Ajit Danti,et al.  Classification of text documents based on score level fusion approach , 2017, Pattern Recognit. Lett..

[12]  Wei Song,et al.  An automatically constructed thesaurus for neural network based document categorization , 2009, Expert Syst. Appl..

[13]  George D. C. Cavalcanti,et al.  Combining dissimilarity spaces for text categorization , 2017, Inf. Sci..

[14]  A. B. Frolov,et al.  On the classification of text documents taking into account their structural features , 2016 .

[15]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[16]  B. S. Harish,et al.  Document classification using Symbolic classifiers , 2014, 2014 International Conference on Contemporary Computing and Informatics (IC3I).

[17]  Jian Yu,et al.  A multi-layer text classification framework based on two-level representation model , 2012, Expert Syst. Appl..

[18]  Heng Zhang,et al.  Improving short text classification by learning vector representations of both words and hidden topics , 2016, Knowl. Based Syst..