A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

[1]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[2]  C. J. Kim,et al.  An algorithmic approach for fuzzy inference , 1997, IEEE Trans. Fuzzy Syst..

[3]  Martin Vingron,et al.  Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma , 2013, International journal of cancer.

[4]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[5]  Tae-Sun Choi,et al.  Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers , 2009, Amino Acids.

[6]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[7]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[11]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[12]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[13]  Peter A. Flach,et al.  Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II , 2012 .

[14]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[15]  Thomas Lengauer,et al.  Comparison of Classifier Fusion Methods for Predicting Response to Anti HIV-1 Therapy , 2008, PloS one.

[16]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[17]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[18]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[19]  Gavin Brown,et al.  "Good" and "Bad" Diversity in Majority Vote Ensembles , 2010, MCS.

[20]  Daoqiang Zhang,et al.  Ensemble sparse classification of Alzheimer's disease , 2012, NeuroImage.

[21]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[24]  Michael Costanzo,et al.  Genetic interactions reveal the evolutionary trajectories of duplicate genes , 2010, Molecular systems biology.

[25]  Chulhyun Kim,et al.  Forecasting time series with genetic fuzzy predictor ensemble , 1997, IEEE Trans. Fuzzy Syst..

[26]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[27]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[28]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[29]  Vipin Kumar,et al.  Incorporating functional inter-relationships into protein function prediction algorithms , 2009, BMC Bioinformatics.

[30]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[31]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[32]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[33]  Olga G. Troyanskaya,et al.  Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis , 2009, PLoS genetics.

[34]  Vipin Kumar,et al.  An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions , 2010, PLoS Comput. Biol..

[35]  B. Garvik,et al.  Principles for the Buffering of Genetic Variation , 2001, Science.

[36]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[37]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[38]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[39]  Torsten Hothorn,et al.  Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[40]  José Hernández-Orallo,et al.  On the effect of calibration in classifier combination , 2013, Applied Intelligence.

[41]  Yang Yu,et al.  Diversity Regularized Ensemble Pruning , 2012, ECML/PKDD.

[42]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[43]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[44]  Christopher J. Merz,et al.  Using Correspondence Analysis to Combine Classifiers , 1999, Machine Learning.

[45]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[46]  Isabelle Guyon,et al.  Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009 .

[47]  R. E. Lee,et al.  Distribution-free multiple comparisons between successive treatments , 1995 .

[48]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[49]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[50]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[51]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[52]  Wolfgang Huber,et al.  Mapping of signaling networks through synthetic genetic interaction analysis by RNAi , 2011, Nature Methods.