论文信息 - A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

Gaurav Pandey | Sean Whalen | G. Pandey | Sean Whalen

[1] Dirk Eddelbuettel,et al. Rcpp: Seamless R and C++ Integration , 2011 .

[2] C. J. Kim,et al. An algorithmic approach for fuzzy inference , 1997, IEEE Trans. Fuzzy Syst..

[3] Martin Vingron,et al. Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma , 2013, International journal of cancer.

[4] M. Friedman. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[5] Tae-Sun Choi,et al. Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers , 2009, Amino Acids.

[6] G. Yule. On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[7] R. Tibshirani,et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8] Rich Caruana,et al. Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[10] Bianca Zadrozny,et al. Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[11] Rich Caruana,et al. Ensemble selection from libraries of models , 2004, ICML.

[12] Albert Y. Zomaya,et al. A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[13] Peter A. Flach,et al. Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II , 2012 .

[14] Kurt Hornik,et al. Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[15] Thomas Lengauer,et al. Comparison of Classifier Fusion Methods for Predicting Response to Anti HIV-1 Therapy , 2008, PloS one.

[16] Yudong D. He,et al. Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[17] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[18] John Platt,et al. Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[19] Gavin Brown,et al. "Good" and "Bad" Diversity in Majority Vote Ensembles , 2010, MCS.

[20] Daoqiang Zhang,et al. Ensemble sparse classification of Alzheimer's disease , 2012, NeuroImage.

[21] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[22] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23] Matthew A. Hibbs,et al. Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[24] Michael Costanzo,et al. Genetic interactions reveal the evolutionary trajectories of duplicate genes , 2010, Molecular systems biology.

[25] Chulhyun Kim,et al. Forecasting time series with genetic fuzzy predictor ensemble , 1997, IEEE Trans. Fuzzy Syst..

[26] Xin Yao,et al. An analysis of diversity measures , 2006, Machine Learning.

[27] Lior Rokach,et al. Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[28] Ludmila I. Kuncheva,et al. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[29] Vipin Kumar,et al. Incorporating functional inter-relationships into protein function prediction algorithms , 2009, BMC Bioinformatics.

[30] William N. Venables,et al. Modern Applied Statistics with S , 2010 .

[31] Yoav Freund,et al. Boosting: Foundations and Algorithms , 2012 .

[32] Ian H. Witten,et al. Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[33] Olga G. Troyanskaya,et al. Computationally Driven, Quantitative Experiments Discover Genes Required for Mitochondrial Biogenesis , 2009, PLoS genetics.

[34] Vipin Kumar,et al. An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions , 2010, PLoS Comput. Biol..

[35] B. Garvik,et al. Principles for the Buffering of Genetic Variation , 2001, Science.

[36] Gaurav Pandey,et al. Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[37] Giovanni Seni,et al. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[38] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[39] Torsten Hothorn,et al. Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[40] José Hernández-Orallo,et al. On the effect of calibration in classifier combination , 2013, Applied Intelligence.

[41] Yang Yu,et al. Diversity Regularized Ensemble Pruning , 2012, ECML/PKDD.

[42] Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[43] Andrew W. Fitzgibbon,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[44] Christopher J. Merz,et al. Using Correspondence Analysis to Combine Classifiers , 1999, Machine Learning.

[45] Kagan Tumer,et al. Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[46] Isabelle Guyon,et al. Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009 .

[47] R. E. Lee,et al. Distribution-free multiple comparisons between successive treatments , 1995 .

[48] Andy Liaw,et al. Classification and Regression by randomForest , 2007 .

[49] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[50] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[51] Max Kuhn,et al. Building Predictive Models in R Using the caret Package , 2008 .

[52] Wolfgang Huber,et al. Mapping of signaling networks through synthetic genetic interaction analysis by RNAi , 2011, Nature Methods.