Choosing an Appropriate Ensemble Classifier for Balanced Bioinformatics Data

Bioinformatics datasets contain a number of characteristics, such as noisy data and difficult to learn class boundaries, which make it challenge to build effective predictive models. One option for improving results is the use of ensemble learning methods, which involve combining the results of multiple predictive models into a single decision. Since we do not rely on a single model, we reduce the effect of any hidden bias which may reside in a single model. In this study, we investigate two ensemble learning methods, Select-Bagging and Random Forest, to find which one is better suited for classification of bioinformatics data. In addition, we examine how the choice of learning algorithms affects the classification results of the Bagging method. We conduct an empirical study using six ensemble classifiers (Random Forest and Select-Bagging utilizing five different classifiers) applied to 12 balanced datasets using three feature rankers along with four feature subset sizes. Based on our results, including statistical analysis, we recommend Random Forest as it is competitive with the best of the Select-Bagging classifiers, and does not require an additional choice of classifier, which can significantly affect classification performance. To our knowledge, this work is unique in terms of investigating the effectiveness of these two ensemble learning methods in the domain of bioinformatics as well as examining how the choice of classifier impacts classification results when using a Bagging-based ensemble learning method.

[1]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[3]  Wei Wang,et al.  A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. , 2004, Cancer cell.

[4]  Taghi M. Khoshgoftaar,et al.  Random forest: A reliable tool for patient response prediction , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[5]  Taghi M. Khoshgoftaar,et al.  Random Forest with 200 Selected Features: An Optimal Model for Bioinformatics Research , 2013, 2013 12th International Conference on Machine Learning and Applications.

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  Taghi M. Khoshgoftaar,et al.  First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques , 2012, 2012 11th International Conference on Machine Learning and Applications.

[8]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[9]  Bob Löwenberg,et al.  A 2-gene classifier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia. , 2007, Blood.

[10]  Taghi M. Khoshgoftaar,et al.  Simplifying the Utilization of Machine Learning Techniques for Bioinformatics , 2013, 2013 12th International Conference on Machine Learning and Applications.

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[13]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[14]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[15]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[18]  Taghi M. Khoshgoftaar,et al.  A comparative evaluation of feature ranking methods for high dimensional bioinformatics data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[19]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[20]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[21]  Taghi M. Khoshgoftaar,et al.  Select-Bagging: Effectively Combining Gene Selection and Bagging for Balanced Bioinformatics Data , 2014, 2014 IEEE International Conference on Bioinformatics and Bioengineering.

[22]  Anthony Boral,et al.  Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. , 2006, Blood.

[23]  Hua Wang,et al.  A Comparative Study of Classification Methods For Microarray Data Analysis , 2006, AusDM.

[24]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[26]  P. Sebastiani,et al.  Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer , 2007, Nature Medicine.

[27]  Ian H. Witten,et al.  Data Mining: Practical Machine Learning Tools and Techniques, 3/E , 2014 .

[28]  Tao Chen A selective ensemble classification method on micro array data , 2014 .

[29]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[30]  Taghi M. Khoshgoftaar,et al.  A Review of Ensemble Classification for DNA Microarrays Data , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[31]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[32]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[34]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.