New approach for imbalanced biological dataset classification

This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two -class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100%, 200%, 300%, 400%, and 500% of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTERotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.

[1]  Shuigeng Zhou,et al.  Prediction of protein-protein interaction sites using an ensemble method , 2009, BMC Bioinformatics.

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  J A Hanley,et al.  Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. , 1997, Academic radiology.

[4]  Zhu-Hong You,et al.  Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis , 2013, BMC Bioinformatics.

[5]  H. Kraemer,et al.  Measurement of reliability for categorical data in medical research , 1992, Statistical methods in medical research.

[6]  Li Zhu,et al.  Data Mining on Imbalanced Data Sets , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[7]  Juan José Rodríguez Diez,et al.  Random Subspace Ensembles for fMRI Classification , 2010, IEEE Transactions on Medical Imaging.

[8]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[9]  Loris Nanni,et al.  Ensemblator: An ensemble of classifiers for reliable classification of biological data , 2007, Pattern Recognit. Lett..

[10]  Anant Madabhushi,et al.  An active learning based classification strategy for the minority class problem: application to histopathology annotation , 2011, BMC Bioinformatics.

[11]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[12]  Jianlin Cheng,et al.  DNdisorder: predicting protein disorder using boosting and deep networks , 2013, BMC Bioinformatics.

[13]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[14]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[15]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[17]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[18]  Juan José Rodríguez Diez,et al.  Rotation Forests for regression , 2013, Appl. Math. Comput..

[19]  Bailing Zhang,et al.  Phenotype Recognition with Combined Features and Random Subspace Classifier Ensemble , 2011, BMC Bioinformatics.

[20]  Tianfeng Chai,et al.  Evaluation of the United States National Air Quality Forecast Capability experimental real-time predictions in 2010 using Air Quality System ozone and NO 2 measurements , 2013 .

[21]  Dietrich Rebholz-Schuhmann,et al.  Assessment of disease named entity recognition on a corpus of annotated sentences , 2008, BMC Bioinformatics.

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[24]  Zheng Fang,et al.  Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure , 2011, BMC Bioinformatics.

[25]  Enrico Blanzieri,et al.  Identification of Regulatory Binding Sites on mRNA Using in Vivo Derived Informations and SVMs , 2012, PACBB.

[26]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[27]  Michelangelo Ceci,et al.  Integrating microRNA target predictions for the discovery of gene regulatory networks: a semi-supervised ensemble learning approach , 2014, BMC Bioinformatics.

[28]  F. O’Connor,et al.  Air quality modelling using the Met Office Unified Model (AQUM OS24-26): model description and initial evaluation , 2013 .

[29]  Akin Özçift,et al.  SVM Feature Selection Based Rotation Forest Ensemble Classifiers to Improve Computer-Aided Diagnosis of Parkinson Disease , 2011, Journal of Medical Systems.

[30]  Ester Bernadó-Mansilla,et al.  The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[31]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[32]  Xingquan Zhu,et al.  Lazy Bagging for Classifying Imbalanced Data , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[33]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[34]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[35]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[36]  Roland R. Draxler,et al.  Root mean square error (RMSE) or mean absolute error (MAE) , 2014 .

[37]  Sotiris B. Kotsiantis,et al.  Combining bagging, boosting, rotation forest and random subspace methods , 2011, Artificial Intelligence Review.

[38]  Kamran Raza,et al.  Effect of Feature Selection, SMOTE and under Sampling on Class Imbalance Classification , 2012, 2012 UKSim 14th International Conference on Computer Modelling and Simulation.

[39]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[40]  Wahyu Kusuma,et al.  Journal of Theoretical and Applied Information Technology , 2012 .

[41]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[42]  P. Choudhary Analyzing Rater Agreement: Manifest Variable Methods , 2006 .

[43]  Geoffrey J. McLachlan,et al.  Ensemble Approach for the Classification of Imbalanced Data , 2009, Australasian Conference on Artificial Intelligence.

[44]  Arif Gulten,et al.  A Robust Multi-Class Feature Selection Strategy Based on Rotation Forest Ensemble Algorithm for Diagnosis of Erythemato-Squamous Diseases , 2012, Journal of medical systems.

[45]  Boonserm Kaewkamnerdpong,et al.  Heterogeneous ensemble approach with discriminative features and modified-SMOTEbagging for pre-miRNA classification , 2012, Nucleic acids research.

[46]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[47]  Abbas Z. Kouzani,et al.  Dual-random Ensemble Method for Multi-label Classification of Biological Data , 2009 .

[48]  William B. White,et al.  Receiver operating characteristics of home blood pressure measurements thatbest detect ambulatory hypertension , 2004 .

[49]  Kenji Satou,et al.  A novel over-sampling method and its application to miRNA prediction , 2013 .

[50]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[51]  J. Sim,et al.  The kappa statistic in reliability studies: use, interpretation, and sample size requirements. , 2005, Physical therapy.

[52]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[53]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[54]  Carlos J. Alonso,et al.  Rotation Forest on Microarray Domain: PCA versus ICA , 2010, IEA/AIE.

[55]  Vasant Honavar,et al.  Glycosylation site prediction using ensembles of Support Vector Machine classifiers , 2007, BMC Bioinformatics.

[56]  Georgiy V. Bobashev,et al.  Random forest methodology for model-based recursive partitioning: the mobForest package for R , 2013, BMC Bioinformatics.

[57]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[58]  Chun-Xia Zhang,et al.  An empirical study of using Rotation Forest to improve regressors , 2008, Appl. Math. Comput..

[59]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[60]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[61]  Haruhiko Kimura,et al.  LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data , 2013, BioData Mining.

[62]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[63]  Giorgio Valentini,et al.  Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction , 2010, MLSB.

[64]  Yanjun Qi Random Forest for Bioinformatics , 2012 .

[65]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[66]  Cen Li,et al.  Classifying imbalanced data using a bagging ensemble variation (BEV) , 2007, ACM-SE 45.

[67]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[68]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[69]  Giorgio Valentini,et al.  Supervised and Unsupervised Ensemble Methods and their Applications , 2008 .

[70]  Akshay Nikam,et al.  SkewBoost: An algorithm for classifying imbalanced datasets , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[71]  L. Qu,et al.  mirExplorer: Detecting microRNAs from genome and next generation sequencing data using the AdaBoost method with transition probability matrix and combined features , 2011, RNA biology.

[72]  Art Noda,et al.  Kappa coefficients in medical research , 2002, Statistics in medicine.

[73]  Zhou Wei-xiong A classification method for class-imbalanced data , 2011 .

[74]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[75]  Marcel Abendroth,et al.  Data Mining Practical Machine Learning Tools And Techniques With Java Implementations , 2016 .

[76]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[77]  Francis K. H. Quek,et al.  Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[78]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[79]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[80]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[81]  Somnuk Phon-Amnuaisuk,et al.  Using Rotation Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, EvoBIO.

[82]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[83]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[84]  Gary E. Birch,et al.  Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[85]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[86]  Aboul Ella Hassanien,et al.  Ensemble classifiers for biomedical data: Performance evaluation , 2013, 2013 8th International Conference on Computer Engineering & Systems (ICCES).

[87]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[88]  C. Willmott,et al.  Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance , 2005 .

[89]  Hsiao-Yun Huang,et al.  Imbalanced data classification using random subspace method and SMOTE , 2012, The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems.

[90]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[91]  Xiangji Huang,et al.  Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles , 2006, PAKDD.

[92]  Torsten Hothorn,et al.  Ensemble Methods of Computational Inference , 2005 .

[93]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[94]  K. Gwet,et al.  Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters , 2008 .

[95]  Arif Gülten,et al.  Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms , 2011, Comput. Methods Programs Biomed..

[96]  Pooja Jain,et al.  Automatic structure classification of small proteins using random forest , 2010, BMC Bioinformatics.

[97]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[98]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[99]  D G Altman,et al.  Statistics Notes: Diagnostic tests 3: receiver operating characteristic plots , 1994, BMJ.

[100]  Thomas McGinn,et al.  Tips for learners of evidence-based medicine: 3. Measures of observer variability (kappa statistic) , 2004, Canadian Medical Association Journal.

[101]  Andrew E. Jaffe,et al.  Gene set bagging for estimating the probability a statistically significant result will replicate , 2013, BMC Bioinformatics.

[102]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[103]  Morteza Analoui,et al.  A new ensemble classifier creation method by creating new training set for each base classifier , 2013, The 5th Conference on Information and Knowledge Technology.

[104]  Ernest Fraenkel,et al.  Sequence analysis A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data , 2006 .

[105]  Rohit Mathur,et al.  Assessment of an ensemble of seven real-time ozone forecasts over eastern North America during the summer of 2004 , 2005 .

[106]  Turgay Ibrikci,et al.  Effective Diagnosis of Coronary Artery Disease Using The Rotation Forest Ensemble Method , 2012, Journal of Medical Systems.

[107]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[108]  Chun-Xia Zhang,et al.  RotBoost: A technique for combining Rotation Forest and AdaBoost , 2008, Pattern Recognit. Lett..

[109]  Jun Du,et al.  Foundation of Mining Class-Imbalanced Data , 2012, PAKDD.

[110]  John Shawe-Taylor,et al.  Optimizing Classifers for Imbalanced Training Sets , 1998, NIPS.

[111]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[112]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[113]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[114]  De-Shuang Huang,et al.  Cancer classification using Rotation Forest , 2008, Comput. Biol. Medicine.

[115]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[116]  Juan José Rodríguez Diez,et al.  An Experimental Study on Rotation Forest Ensembles , 2007, MCS.

[117]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[118]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[119]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[120]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[121]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[122]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[123]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[124]  Carla E. Brodley,et al.  Class Imbalance, Redux , 2011, 2011 IEEE 11th International Conference on Data Mining.

[125]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[126]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[127]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[128]  Peter Kokol,et al.  Effectiveness of Rotation Forest in Meta-learning Based Gene Expression Classification , 2007, Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS'07).

[129]  Mohd Saberi Mohamad,et al.  Random forest for gene selection and microarray data classification , 2011, Bioinformation.