Combining Multiple Views from a Distance Based Feature Extraction for Text Classification

Text Mining is a challenging task due to the lack of a naturally structured representation and the high dimensionality induced by the feature extraction techniques commonly used. Different feature extractions can lead to multiple views that can capture different aspects of the text documents being analyzed. The combination of these features can lead to a better accuracy in classification tasks but, also, an undesirable increase in the number of features. In this work, we investigate the use of a feature extraction technique called DCDistance used as a multiple feature extraction for text documents combined with a Genetic Algorithm based feature selection, hereby called MVDCD. The results show that the main advantage of MVDCD is that the dimensionality is reduced by more than 90% while significantly increasing the classification accuracy when compared to vanilla DCDistance and other feature selections techniques. A side effect of the use of DCDistance and MVDCD is the possibility of model interpretability, as the extracted features are explicit.

[1]  Yong Dou,et al.  Multi-view clustering with extreme learning machine , 2016, Neurocomputing.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Tunga Güngör,et al.  Comparison of text feature selection policies and using an adaptive framework , 2013, Expert Syst. Appl..

[4]  Millie Pant,et al.  Link based BPSO for feature selection in big data text clustering , 2017, Future Gener. Comput. Syst..

[5]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[6]  Ondrej Krejcar,et al.  Modified frequency-based term weighting schemes for text classification , 2017, Appl. Soft Comput..

[7]  Shiliang Sun,et al.  Consensus and complementarity based maximum entropy discrimination for multi-view classification , 2016, Inf. Sci..

[8]  Pramod Kumar Singh,et al.  Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering , 2015, Expert Syst. Appl..

[9]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[10]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[11]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[12]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  Lakhmi C. Jain,et al.  Feature Selection for Data and Pattern Recognition , 2014, Feature Selection for Data and Pattern Recognition.

[15]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Gene Clusters Analysis Using Text Mining , 2004, WOB.

[16]  Kesari Verma,et al.  Variable Global Feature Selection Scheme for automatic classification of text documents , 2017, Expert systems with applications.

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  Fabiana Santana,et al.  FCFilter: Feature selection based on clustering and genetic algorithms , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[19]  Fabrício Olivetti de França,et al.  DCDistance: A Supervised Text Document Feature extraction based on class labels , 2018, ArXiv.

[20]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[21]  W. Marsden I and J , 2012 .

[22]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[23]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[24]  Hongyu Guo,et al.  End-to-End Multi-View Networks for Text Classification , 2017, ArXiv.

[25]  Hongfei Lin,et al.  A two-stage feature selection method for text categorization , 2010, FSKD.

[26]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[29]  Steven C. H. Hoi,et al.  Multiview Semi-Supervised Learning with Consensus , 2012, IEEE Transactions on Knowledge and Data Engineering.

[30]  Abdur Rehman,et al.  Relative discrimination criterion - A novel feature ranking method for text data , 2015, Expert Syst. Appl..

[31]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Azuraliza Abu Bakar,et al.  Hybrid feature selection based on enhanced genetic algorithm for text categorization , 2016, Expert Syst. Appl..

[33]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[34]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[35]  Bo Jiang,et al.  Multi-view clustering via simultaneous weighting on views and features , 2016, Appl. Soft Comput..