Cascading k-means with Ensemble Learning: Enhanced Categorization of Diabetic Data

Abstract. This paper illustrates the applications of various ensemble methods for enhanced classification accuracy. The case in point is the Pima Indian Diabetic Dataset (PIDD). The computational model comprises of two stages. In the first stage, k-means clustering is employed to identify and eliminate wrongly classified instances. In the second stage, a fine tuning in the classification was effected. To do this, ensemble methods such as AdaBoost, bagging, dagging, stacking, decorate, rotation forest, random subspace, MultiBoost and grading were invoked along with five chosen base classifiers, namely support vector machine (SVM), radial basis function network (RBF), decision tree J48, naïve Bayes and Bayesian network. The k-fold cross validation technique is adopted. Computational experiments with the proposed method showed an improvement of 16.14% to 22.49% in the classification accuracy compared to literature survey. Among the ensemble methods tried, MultiBoost ensemble with SVM classifier and grading ensemble with naïve Bayes showed the best performance followed by MultiBoost, stacking and grading ensemble with Bayesian classifier, rotation forest ensemble with RBF and grading and rotation forest ensemble with J48. This investigation conclusively proves the significance of cascading k-means clustering with ensemble methods in the enhanced accuracy in categorization of diabetic dataset.

[1]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Joseph L. Breault,et al.  Data Mining Diabetic Databases: Are Rough Sets a Useful Addition? , 2001 .

[4]  Novruz Allahverdi,et al.  Design of a hybrid system for the diabetes and heart diseases , 2008, Expert Syst. Appl..

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[8]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Johannes Fürnkranz,et al.  An Evaluation of Grading Classifiers , 2001, IDA.

[10]  Kemal Polat,et al.  A cascade learning system for classification of diabetes disease: Generalized Discriminant Analysis and Least Square Support Vector Machine , 2008, Expert Syst. Appl..

[11]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Durga Toshniwal,et al.  Hybrid prediction model for Type-2 diabetic patients , 2010, Expert Syst. Appl..

[16]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[17]  Terry Windeatt,et al.  Vote counting measures for ensemble classifiers , 2003, Pattern Recognit..

[18]  M. Skurichina,et al.  Stabilizing weak classifiers: regularization and combining techniques in discriminant analysis , 2001 .