Tree-based classifier ensembles for early detection method of diabetes: an exploratory study

Diabetes is a lifestyle-driven disease which has become a critical health issue worldwide. In this paper, we conduct an exploratory study about early detection method of diabetes mellitus using various ensemble learning techniques. Eight tree-based machine learning algorithms, i.e. classification and regression tree, decision tree (C4.5), reduced error pruning tree, random tree, naive Bayes tree, functional tree, best-first decision tree and logistic model tree are employed as a base classifier in five different ensembles, i.e. bagging, boosting, random subspace, DECORATE, and rotation forest. The performance of ensembles and base classifiers are thoroughly benchmarked on three real-world datasets in term of area under receiver operating characteristic curve metric. Finally, we assess the performance differences among the classifiers using several statistical significant tests. We contribute to the existing literature regarding an extensive benchmark of tree-based classifier ensembles for early detection method of diabetes disease.

[1]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[2]  Emil Ginter,et al.  Global prevalence and future of diabetes mellitus. , 2012, Advances in experimental medicine and biology.

[3]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[4]  Bayu Adhi Tama Detection of Type 2 Diabetes Mellitus Disease with Data Mining Approach Using Support Vector Machine , 2010 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Sungyoung Lee,et al.  Prediction of Diabetes Mellitus Based on Boosting Ensemble Modeling , 2014, UCAmI.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Edward P. K. Tsang,et al.  Simplifying Decision Trees Learned by Genetic Programming , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[9]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[10]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[11]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[12]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Marzena Kryszkiewicz,et al.  Machine Intelligence and Big Data in Industry , 2016 .

[16]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[17]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[18]  Usman Qamar,et al.  IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework , 2016, J. Biomed. Informatics.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  O. J. Dunn Multiple Comparisons Using Rank Sums , 1964 .

[22]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[23]  Mgs. Afriyan Firdaus,et al.  Detecting major disease in public hospital using ensemble techniques , 2014, 2014 International Symposium on Technology Management and Emerging Technologies.

[24]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[25]  Aboul Ella Hassanien,et al.  Identification of Diabetes Disease Using Committees of Neural Network-Based Classifiers , 2016 .

[26]  Bayu Adhi Tama,et al.  An Early Detection Method of Type-2 Diabetes Mellitus in Public Hospital , 2011 .

[27]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[28]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[29]  J. Shaw,et al.  Global estimates of the prevalence of diabetes for 2010 and 2030. , 2010, Diabetes research and clinical practice.

[30]  Usman Qamar,et al.  HMV: A medical decision support framework using multi-layer classifiers for disease prediction , 2016, J. Comput. Sci..

[31]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[32]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[33]  Mehdi Teimouri,et al.  Comparison of various classification algorithms in the diagnosis of type 2 diabetes in Iran , 2016, International Journal of Diabetes in Developing Countries.

[34]  Gian Luca Marcialis,et al.  Fusion of appearance-based face recognition algorithms , 2004, Pattern Analysis and Applications.

[35]  Haijia Shi Best-first Decision Tree Learning , 2007 .

[36]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[37]  Qing Xie,et al.  An improved early detection method of type-2 diabetes mellitus using multiple classifier system , 2015, Inf. Sci..

[38]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[39]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.