Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics

With the fast developments in science and technology, massive data sets are generated in an exponential rate. In recent years, many supervised classification methods have shown good performance on balanced data, however, imbalanced data mining is still a new and long-term challenging research area. In this dissertation, we study the problem of how to build efficient ensemble classifier to learn from imbalanced datasets. A formal definition for imbalanced binary classification problem is proposed and several challenging aspects of learning from imbalanced data are discussed. We extensively investigate the current research trends in handling imbalance learning problems to provide a comprehensive overview of representative studies in this area. Our main contribution of this work is to propose a new ensemble framework—Diversified Ensemble Classifiers for Imbalanced Data Learning (DECIDL), based on the advantages of several existing ensemble imbalanced learning strategies. Our strategy combines three popular learning techniques together: a) ensemble learning, b) artificial example generation, and c) diversity construction by using oppositional data re-labeling. As a meta-learner, DECIDL can utilize general supervised learning algorithms, such as support vector machines, decision trees and neural networks, etc., as the base learner to build an effective ensemble committee. A comprehensive benchmark pool is developed to enclose 30 public imbalanced data sets with diversified data characteristics from multiple real applications. All the data sets are highly skewed with imbalance ratio ranging from 10:1 to 100:1, and have never been completely and systematically studied in any work. In this dissertation, we compare the DECIDL framework with several existing ensemble learning frameworks, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost on this benchmark data pool. Extensive experiments with various base learners suggest that our DECIDL framework are comparable with other ensemble methods, in terms of averaged F-measure and MCC performance on 30 data sets with four base learners (decision stumps, decision trees, linear support vector machines, and perceptron neural networks). The data sets, experiments and results provide a complete and valuable knowledge base for any future research on highly imbalanced data learning. Additional experiments are also conducted to verify the DECIDL effectiveness under various technical settings. INDEX WORDS: Machine learning, Classification, Imbalanced data learning, Diversified ensemble, Bioinformatics, Protein methylation

[1]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[2]  Yanqing Zhang,et al.  Granular decision fusion systems for effective protein methylation pPrediction , 2008, 2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[3]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[4]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[5]  W. Paik,et al.  Enzymatic methylation of protein fractions from calf thymus nuclei. , 1967, Biochemical and biophysical research communications.

[6]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[7]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[8]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[9]  Jorng-Tzong Horng,et al.  Incorporating structural characteristics for identification of protein methylation sites , 2009, J. Comput. Chem..

[10]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[11]  Michael Lindenbaum,et al.  Selective Sampling for Nearest Neighbor Classifiers , 1999, Machine Learning.

[12]  Irene T Weber,et al.  Atomic resolution crystal structures of HIV‐1 protease and mutants V82A and I84V with saquinavir , 2007, Proteins.

[13]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[14]  Jian Yu,et al.  A New Improved K-Means Algorithm with Penalized Term , 2007 .

[15]  Mark T Bedford,et al.  Arginine methylation an emerging regulator of protein function. , 2005, Molecular cell.

[16]  Jack Y. Yang,et al.  Asymmetric Bagging and Feature Selection for Activities Prediction of Drug Molecules , 2007, IMSCCS.

[17]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  Yu Xue,et al.  MeMo: a web tool for prediction of protein methylation modifications , 2006, Nucleic Acids Res..

[21]  Cen Li,et al.  Classifying imbalanced data using a bagging ensemble variation (BEV) , 2007, ACM-SE 45.

[22]  Hwanjo Yu,et al.  SVM selective sampling for ranking with application to data retrieval , 2005, KDD '05.

[23]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[24]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Hyun-Chul Kim,et al.  Pattern classification using support vector machine ensemble , 2002, Object recognition supported by user interaction for service robots.

[26]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[27]  Edward Y. Chang,et al.  Aligning boundary in kernel space for learning imbalanced dataset , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[28]  Joan M Hevel,et al.  Substrate profiling of PRMT1 reveals amino acid sequences that extend beyond the "RGG" paradigm. , 2008, Biochemistry.

[29]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[30]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[31]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[32]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[33]  Raymond J. Mooney,et al.  Creating diverse ensemble classifiers to reduce supervision , 2005 .

[34]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[35]  Yanqing Zhang,et al.  Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction , 2006, 2006 IEEE International Conference on Granular Computing.

[36]  J. R. Morris,et al.  Genes, genetics, and epigenetics: a correspondence. , 2001, Science.

[37]  Dariusz Plewczynski,et al.  AutoMotif server: prediction of single residue post-translational modifications in proteins , 2005, Bioinform..

[38]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[39]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[40]  Rong Yan,et al.  On predicting rare classes with SVM ensembles in scene classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[41]  Shandar Ahmad,et al.  RVP-net: online prediction of real valued accessible surface area of proteins from single sequences , 2003, Bioinform..

[42]  Rong Yan,et al.  Automatically labeling video data using multi-class active learning , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[43]  Yanqing Zhang,et al.  Additive noise analysis on microarray data via SVM classification , 2010, 2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[44]  Mark Craven,et al.  Curious machines: active learning with structured instances , 2008 .

[45]  Fredrik Olsson,et al.  Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora , 2008 .

[46]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[47]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[48]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[49]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[50]  Byoung-Tak Zhang,et al.  Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[52]  Yanqing Zhang,et al.  Identifying New Methylated Arginines via Granular Decision Fusion with SVM Modeling , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[53]  Wojciech Ziarko,et al.  A METHOD FOR COMPUTING ALL MAXIMALLY GENERAL RULES IN ATTRIBUTE‐VALUE SYSTEMS , 1996, Comput. Intell..

[54]  Peter Tiño,et al.  Managing Diversity in Regression Ensembles , 2005, J. Mach. Learn. Res..

[55]  Yiyu Yao,et al.  Foundations of Classification , 2006, Foundations and Novel Approaches in Data Mining.

[56]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[57]  Ronen Marmorstein,et al.  Structure and activity of enzymes that remove histone modifications. , 2005, Current opinion in structural biology.

[58]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[59]  Gregory A.Petsko and Dagmar Ringe Protein structure and function , 2003 .

[60]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[61]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[62]  Dong Xu,et al.  Computational Identification of Protein Methylation Sites through Bi-Profile Bayes Feature Extraction , 2009, PloS one.

[63]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[64]  Zdzislaw Pawlak,et al.  Information systems theoretical foundations , 1981, Inf. Syst..

[65]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[66]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[67]  Byoung-Tak Zhang,et al.  Ensemble Learning Based on Active Example Selection for Solving Imbalanced Data Problem in Biomedical Data , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[68]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[69]  Hiroyuki Sasaki,et al.  Imprinting and looping: epigenetic marks control interactions between regulatory elements. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[70]  Saso Dzeroski,et al.  Combining Bagging and Random Subspaces to Create Better Ensembles , 2007, IDA.

[71]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[72]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[73]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[74]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[75]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[76]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[77]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[78]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[79]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[80]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[81]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[82]  Christopher T. Walsh,et al.  Posttranslational Modification of Proteins: Expanding Nature's Inventory , 2005 .

[83]  Russell Greiner,et al.  Optimistic Active-Learning Using Mutual Information , 2007, IJCAI.

[84]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[85]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[86]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[87]  N. Tchurikov,et al.  Molecular Mechanisms of Epigenetics , 2005, Biochemistry (Moscow).

[88]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[89]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[90]  Nathan Intrator,et al.  Optimal ensemble averaging of neural networks , 1997 .

[91]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[92]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[93]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[94]  Igor Kononenko,et al.  Cost-Sensitive Learning with Neural Networks , 1998, ECAI.

[95]  R Holliday,et al.  The inheritance of epigenetic defects. , 1987, Science.

[96]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[97]  H. Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009 .

[98]  Ian Davidson,et al.  An Ensemble Technique for Stable Learners with Performance Bounds , 2004, AAAI.

[99]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[100]  Oreste Acuto,et al.  Protein arginine methylation in lymphocyte signaling. , 2006, Current opinion in immunology.

[101]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[102]  Yanqing Zhang,et al.  Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis , 2007, TCBB.

[103]  Ying Liu,et al.  Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification , 2004, J. Chem. Inf. Model..

[104]  Peter Cheung,et al.  Epigenetic regulation by histone methylation and histone variants. , 2005, Molecular endocrinology.

[105]  Ji Gao,et al.  Improving SVM Classification with Imbalance Data Set , 2009, ICONIP.

[106]  Joydeep Kundu,et al.  Gene Expression Analysis of the Function of the Male-Specific Lethal Complex in Drosophila , 2005, Genetics.

[107]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[108]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[109]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[110]  Ying Wang,et al.  Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants , 2007, ISMB/ECCB.

[111]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[112]  Huanhuan Chen,et al.  Negative correlation learning for classification ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[113]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[114]  Daphne Koller,et al.  Active learning: theory and applications , 2001 .

[115]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[116]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[117]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[118]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[119]  Pat Langley,et al.  Induction of One-Level Decision Trees , 1992, ML.

[120]  Liam J. McGuffin,et al.  Protein structure prediction servers at University College London , 2005, Nucleic Acids Res..

[121]  Joshua Alspector,et al.  Data duplication: an imbalance problem ? , 2003 .

[122]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[123]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[124]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[125]  Yanqing Zhang,et al.  Feature selection and granular SVM classification for protein arginine methylation identification , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[126]  Vikram Krishnamurthy,et al.  Algorithms for optimal scheduling and management of hidden Markov model sensors , 2002, IEEE Trans. Signal Process..

[127]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[128]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[129]  Yanqing Zhang,et al.  Data shuffling and statistical analysis on microarray data for gene selection: a comparative study on filtering methods , 2010, Int. J. Funct. Informatics Pers. Medicine.

[130]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[131]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[132]  Brian D Strahl,et al.  Role of protein methylation in regulation of transcription. , 2005, Endocrine reviews.

[133]  M. S. Brown,et al.  Support Vector Machine Classification of Microarray from Gene Expression Data , 1999 .

[134]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[135]  GuoHongyu,et al.  Learning from imbalanced data sets with boosting and data generation , 2004 .

[136]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[137]  Mu Zhu,et al.  Kernels and Ensembles : Perspectives on Statistical Learning , 2008 .

[138]  K D Robertson,et al.  DNA methylation: past, present and future directions. , 2000, Carcinogenesis.

[139]  Nitesh V. Chawla,et al.  Exploiting Diversity in Ensembles: Improving the Performance on Unbalanced Datasets , 2007, MCS.

[140]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[141]  Honghua Dai,et al.  Parameter Estimation of One-Class SVM on Imbalance Text Classification , 2006, Canadian Conference on AI.

[142]  Pedro M. Domingos Why Does Bagging Work? A Bayesian Account and its Implications , 1997, KDD.