Predicting protein function and other biomedical characteristics with heterogeneous ensembles.

Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.

[1]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[2]  Gary D Bader,et al.  The Genetic Landscape of a Cell , 2010, Science.

[3]  Zhiwen Yu,et al.  Protein Function Prediction Using Multilabel Ensemble Classification , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[5]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[6]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[7]  Tae-Sun Choi,et al.  Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers , 2009, Amino Acids.

[8]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[9]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[10]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Dirk Eddelbuettel,et al.  Seamless R and C++ Integration with Rcpp , 2013 .

[12]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[13]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[19]  Stephan Mehler,et al.  Modern Applied Statistics , 2016 .

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  Ana I. González Acuña An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization , 2012 .

[22]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[23]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[24]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[25]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[26]  Gary D. Bader,et al.  Multiple Genetic Interaction Experiments Provide Complementary Information Useful for Gene Function Prediction , 2012, PLoS Comput. Biol..

[27]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[28]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[29]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[30]  Wolfgang Huber,et al.  Mapping of signaling networks through synthetic genetic interaction analysis by RNAi , 2011, Nature Methods.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Isabelle Guyon,et al.  Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009 .

[33]  Torsten Hothorn,et al.  Model-based Boosting 2.0 , 2010, J. Mach. Learn. Res..

[34]  José Hernández-Orallo,et al.  On the effect of calibration in classifier combination , 2013, Applied Intelligence.

[35]  Yang Yu,et al.  Diversity Regularized Ensemble Pruning , 2012, ECML/PKDD.

[36]  Anna Demming,et al.  The best of both worlds , 2010, Nanotechnology.

[37]  Andrea Califano,et al.  Toward better benchmarking: challenge-based methods assessment in cancer genomics , 2014, Genome Biology.

[38]  Martin Vingron,et al.  Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma , 2013, International journal of cancer.

[39]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[40]  Jianjun Hu,et al.  Minimalist ensemble algorithms for genome-wide protein localization prediction , 2012, BMC Bioinformatics.

[41]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[42]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[43]  Thomas Lengauer,et al.  Comparison of Classifier Fusion Methods for Predicting Response to Anti HIV-1 Therapy , 2008, PloS one.

[44]  Vipin Kumar,et al.  An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions , 2010, PLoS Comput. Biol..

[45]  B. Garvik,et al.  Principles for the Buffering of Genetic Variation , 2001, Science.

[46]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[47]  S. Jenna,et al.  Genetic interaction networks: better understand to better predict , 2013, Front. Genet..

[48]  Christopher J. Merz,et al.  Using Correspondence Analysis to Combine Classifiers , 1999, Machine Learning.

[49]  Wes McKinney,et al.  Python for Data Analysis , 2012 .

[50]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[51]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[52]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[53]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[54]  Giorgio Valentini,et al.  Hierarchical Ensemble Methods for Protein Function Prediction , 2014, ISRN bioinformatics.

[55]  T. Ideker,et al.  Systematic interpretation of genetic interactions using protein networks , 2005, Nature Biotechnology.

[56]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[57]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[58]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[59]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[60]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[61]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[62]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[63]  Daoqiang Zhang,et al.  Ensemble sparse classification of Alzheimer's disease , 2012, NeuroImage.

[64]  Giorgios Kollias,et al.  Role of Synthetic Genetic Interactions in Understanding Functional Interactions Among Pathways , 2012, Pacific Symposium on Biocomputing.

[65]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[66]  Gaurav Pandey,et al.  A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics , 2013, 2013 IEEE 13th International Conference on Data Mining.