Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology

Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work, we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.

[1]  Adam Prügel-Bennett,et al.  Training HMM structure with genetic algorithm for biological sequence analysis , 2004, Bioinform..

[2]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[3]  James M. Bower,et al.  Computational Modeling of Genetic and Biochemical Networks (Computational Molecular Biology) , 2004 .

[4]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[7]  Leo Breiman,et al.  Using Iterated Bagging to Debias Regressions , 2001, Machine Learning.

[8]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[9]  Eibe Frank,et al.  Logistic Model Trees , 2003, ECML.

[10]  Rong Zeng,et al.  Fast and accurate identification of semi-tryptic peptides in shotgun proteomics , 2008, Bioinform..

[11]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[12]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[13]  Johannes Fürnkranz,et al.  An Evaluation of Grading Classifiers , 2001, IDA.

[14]  James P. Reilly,et al.  Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability , 2006, Pacific Symposium on Biocomputing.

[15]  David Page,et al.  A Bayesian Network Approach to Operon Prediction , 2003, Bioinform..

[16]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Kevin J. Cherkauer Human Expert-level Performance on a Scientiic Image Analysis Task by a System Using Combined Artiicial Neural Networks , 1996 .

[18]  Isabelle Guyon,et al.  Winning the KDD Cup Orange Challenge with Ensemble Selection , 2009 .

[19]  F. Baehner The analytical validation of the Oncotype DX Recurrence Score assay , 2016, Ecancermedicalscience.

[20]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[21]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[22]  James Bennett,et al.  The Netflix Prize , 2007 .

[23]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[24]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[25]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[26]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[27]  Bart De Moor,et al.  A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes , 2004, Bioinform..

[28]  Gregory Gutin,et al.  When the greedy algorithm fails , 2004, Discret. Optim..

[29]  James P. Reilly,et al.  A computational approach toward label-free protein quantification using predicted peptide detectability , 2006, ISMB.

[30]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[31]  I-Min A. Dubchak,et al.  A computational approach to identify genes for functional RNAs in genomic sequences. , 2001, Nucleic acids research.

[32]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.