Ensemble vs. Data Sampling: Which Option Is Best Suited to Improve Classification Performance of Imbalanced Bioinformatics Data?

Bioinformatics datasets contain challenging characteristics, such as class imbalance that occurs when one class has many more instances than the other class(es). These challenges make the task of classification much more subtle for practitioners and researchers in the field. Fortunately, there are tools, such as ensemble learning and data sampling methods that can be applied to overcome these problems and improve the performance of supervised classification models. Our motivation for this study is to investigate which option is well suited to tackle this significant challenge for bioinformatics data. Our literature survey shows that no previous work has conducted such an extensive study to examine whether ensemble learning or data sampling is best suited for imbalanced gene expression data. To this end, we carried out an extensive experimental study using five ensemble classification methods, four other classification methods with random under-sampling, three feature rankers along with four feature subset sizes across 15 highly imbalanced bioinformatics datasets. Our results along with statistical analysis confirm that ensemble learning methods in general outperform data sampling techniques in improving classification results. Furthermore, Select-Bagging with Naïve Bayes (NB) followed by Random Forest are the top two performing ensemble techniques. Based on these results, we recommend either Select-Bagging with NB or Random Forest with 100 trees (RF100) for imbalanced datasets. However, RF100, unlike Select-Bagging, does not rely on choice of the base learner.

[1]  Taghi M. Khoshgoftaar,et al.  Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets , 2014, 2014 IEEE International Conference on Bioinformatics and Bioengineering.

[2]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[5]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[8]  Taghi M. Khoshgoftaar,et al.  First Order Statistics Based Feature Selection: A Diverse and Powerful Family of Feature Seleciton Techniques , 2012, 2012 11th International Conference on Machine Learning and Applications.

[9]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[10]  David C. Atkins,et al.  Identification of Molecular Predictors of Response in a Study of Tipifarnib Treatment in Relapsed and Refractory Acute Myelogenous Leukemia , 2007, Clinical Cancer Research.

[11]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.

[16]  Yoko Yamamoto,et al.  Prediction of sensitivity of rectal cancer cells in response to preoperative radiotherapy by DNA microarray analysis of gene expression profiles. , 2006, Cancer research.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Taghi M. Khoshgoftaar,et al.  Random forest: A reliable tool for patient response prediction , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[19]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[20]  Taghi M. Khoshgoftaar,et al.  Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[21]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[22]  L. Holmberg,et al.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts , 2005, Breast Cancer Research.

[23]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[24]  P. Hall,et al.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Taghi M. Khoshgoftaar,et al.  A comparative evaluation of feature ranking methods for high dimensional bioinformatics data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[26]  Yuan Qi,et al.  Evaluation of a 30-Gene Paclitaxel, Fluorouracil, Doxorubicin, and Cyclophosphamide Chemotherapy Response Predictor in a Multicenter Randomized Trial in Breast Cancer , 2010, Clinical Cancer Research.

[27]  Taghi M. Khoshgoftaar,et al.  Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data , 2014, FLAIRS.

[28]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[29]  Taghi M. Khoshgoftaar,et al.  Effects of the Use of Boosting on Classification Performance of Imbalanced Bioinformatics Datasets , 2014, 2014 IEEE International Conference on Bioinformatics and Bioengineering.

[30]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[31]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[32]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[33]  L. Esserman,et al.  A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. , 2011, JAMA.

[34]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[36]  Melanie Hartmann Intermediate Statistical Methods And Applications A Computer Package Approach , 2016 .

[37]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[38]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[39]  Ali Al-Shahib,et al.  Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence , 2005, Applied bioinformatics.