Hybrid supervised clustering based ensemble scheme for text classification

Purpose The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design. Design/methodology/approach An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naive Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks. Findings The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification. Originality/value The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Lei Xi,et al.  Rough set and ensemble learning based semi-supervised algorithm for text classification , 2011, Expert Syst. Appl..

[3]  Ashok N. Srivastava,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2005, J. Comput. Inf. Sci. Eng..

[4]  Jun Meng,et al.  Classifier ensemble selection based on affinity propagation clustering , 2016, J. Biomed. Informatics.

[5]  Xue Li,et al.  Classifying text streams by keywords using classifier ensemble , 2011, Data Knowl. Eng..

[6]  Ioannis Hatzilygeroudis,et al.  Recognizing emotions in text using ensemble of classifiers , 2016, Eng. Appl. Artif. Intell..

[7]  Victor J. Rayward-Smith,et al.  Adapting k-means for supervised clustering , 2006, Applied Intelligence.

[8]  Haytham Elghazel,et al.  Ensemble multi-label text categorization based on rotation forest and latent semantic indexing , 2016, Expert Syst. Appl..

[9]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[10]  Chih-Fong Tsai,et al.  Combining cluster analysis with classifier ensembles to predict financial distress , 2014, Inf. Fusion.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Wen Li,et al.  Two-level hierarchical combination method for text classification , 2011, Expert Syst. Appl..

[13]  Tiago A. Almeida,et al.  Short text opinion detection using ensemble of classifiers and semantic indexing , 2016, Expert Syst. Appl..

[14]  Hadi Sadoghi Yazdi,et al.  Making Diversity Enhancement Based on Multiple Classifier System by Weight Tuning , 2012, Neural Processing Letters.

[15]  Jian Ma,et al.  Sentiment classification: The contribution of ensemble learning , 2014, Decis. Support Syst..

[16]  Ashfaqur Rahman,et al.  Ensemble classifier generation using non-uniform layered clustering and Genetic Algorithm , 2013, Knowl. Based Syst..

[17]  Huaxiang Zhang,et al.  A spectral clustering based ensemble pruning approach , 2014, Neurocomputing.

[18]  Farhang Farahbod,et al.  A NEW CLUSTERING-BASED APPROACH FOR MODELING FUZZY RULE-BASED CLASSIFICATION SYSTEMS , 2013 .

[19]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[20]  Bassam Al-Salemi,et al.  RFBoost: An improved multi-label boosting algorithm and its application to text categorisation , 2016, Knowl. Based Syst..

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[23]  Aytug Onan,et al.  An ensemble scheme based on language function analysis and feature engineering for text genre classification , 2018, J. Inf. Sci..

[24]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[25]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[27]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[28]  Aytug Onan,et al.  A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification , 2016, Expert Syst. Appl..

[29]  Leandro Nunes de Castro,et al.  BeeRBF: A bee-inspired data clustering approach to design RBF neural network classifiers , 2016, Neurocomputing.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Rui Xia,et al.  Ensemble of feature sets and classification algorithms for sentiment classification , 2011, Inf. Sci..

[32]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[33]  Xin-She Yang,et al.  Cuckoo search: recent advances and applications , 2013, Neural Computing and Applications.

[34]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[35]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[36]  Yang Song-ming,et al.  Markov Model and Convergence Analysis Based on Cuckoo Search Algorithm , 2012 .

[37]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[38]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[39]  Felipe Maia Galvão França,et al.  Financial credit analysis via a clustering weightless neural classifier , 2016, Neurocomputing.

[40]  Zachary Blanks,et al.  Ensemble Methods in Machine Learning: An Algorithmic Approach to Derive Distinctive Behaviors of Criminal Activity Applied to the Poaching Domain , 2017 .

[41]  Raymond Y. K. Lau,et al.  Dynamic Clustering Forest: An ensemble framework to efficiently classify textual data stream with concept drift , 2016, Inf. Sci..

[42]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[43]  Yu Wang,et al.  Ensemble classification based on supervised clustering for credit scoring , 2016, Appl. Soft Comput..

[44]  Zoran Obradovic,et al.  Discovering Homogeneous Regions in Spatial Data through Competition , 2000, ICML.

[45]  Amir Hossein Gandomi,et al.  Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems , 2011, Engineering with Computers.

[46]  Xin-She Yang,et al.  Cuckoo Search via Lévy flights , 2009, 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC).

[47]  Abbas Z. Kouzani,et al.  Random forest based lung nodule classification aided by clustering , 2010, Comput. Medical Imaging Graph..

[48]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[49]  James A. Rodger,et al.  Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive , 2015 .

[50]  Christoph F. Eick,et al.  Supervised clustering - algorithms and benefits , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[51]  Azlan Mohd Zain,et al.  Cuckoo Search Algorithm for Optimization Problems—A Literature Review and its Applications , 2014, Appl. Artif. Intell..

[52]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[53]  Aytug Onan,et al.  Ensemble of keyword extraction methods and classifiers in text classification , 2016, Expert Syst. Appl..

[54]  Jing Lu,et al.  Creating ensembles of classifiers via fuzzy clustering and deflection , 2010, Fuzzy Sets Syst..