Effect of simple ensemble methods on protein secondary structure prediction

Ensemble methods for building improved classifier models have been an important topic in machine learning, pattern recognition and data mining areas, where they have shown great promise. They boast a robustness that has spearheaded their application in many practical classification problems, especially when there is a significant diversity among the ensemble members. Actually, they replace traditional machine learning techniques in many applications and special attention has been devoted to them as a mean to improve the prediction accuracy for problems of high complexity. Several combination rules have been investigated in this context. However, it is claimed that no rule is always better than others for designing an optimal decision. The present study evaluates the performance of two different ensemble methods for protein secondary structure prediction. We focus on weighted opinions pooling and the most common aggregation rules for decisions inference. The ensemble members are accurate protein secondary structure single model predictors namely, Multi-Class Support Vector Machines and Artificial Neural Networks. Experiments are carried out using cross-validation tests on RS126 and CB513 benchmark datasets. Our results clearly confirm that ensembles are more accurate than a single model and the experimental comparison of the investigated ensemble schemes demonstrates that the newly introduced rule called Exponential Opinion Pool competes well against state-of-the-art fixed rules, especially the sum rule which in some cases is able to achieve better performance.

[1]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[2]  Emmanuel Monfrini,et al.  A Quadratic Loss Multi-Class SVM for which a Radius-Margin Bound Applies , 2011, Informatica.

[3]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[4]  Gursel Serpen,et al.  Global-Local Hybrid Ensemble Classifier for KDD 2004 Cup Particle Physics Dataset , 2012 .

[5]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[6]  Yann Guermeur,et al.  MSVMpack: A Multi-Class Support Vector Machine Package , 2011, J. Mach. Learn. Res..

[7]  Fabio Roli,et al.  Diversity in Classifier Ensembles: Fertile Concept or Dead End? , 2013, MCS.

[8]  James C. Bezdek,et al.  Decision templates for multiple classifier fusion: an experimental comparison , 2001, Pattern Recognit..

[9]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[10]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[11]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[12]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[13]  Yann Guermeur,et al.  Estimating the Class Posterior Probabilities in Protein Secondary Structure Prediction , 2011, PRIB.

[14]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[15]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[16]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[17]  Venu Govindaraju,et al.  Review of Classifier Combination Methods , 2008, Machine Learning in Document Analysis and Recognition.

[18]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[19]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[20]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Guangdeng Zong,et al.  Delay-range-dependent exponential stability criteria and decay estimation for switched Hopfield neural networks of neutral type , 2010 .

[22]  Narendra S. Chaudhari,et al.  Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction , 2006, Soft Comput..

[23]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[24]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[25]  Belhadri Messabih,et al.  Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction , 2011, Evolutionary bioinformatics online.

[26]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[27]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[28]  Yen-Jen Oyang,et al.  A novel radial basis function network classifier with centers set by hierarchical clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[29]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[30]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[31]  Wei Xing Zheng,et al.  New stability conditions for GRNs with neutral delay , 2013, Soft Comput..

[32]  Zhihua Zhang,et al.  Bayesian Multicategory Support Vector Machines , 2006, UAI.

[33]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[34]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[35]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[36]  Gaurav Pandey,et al.  A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics , 2013, 2013 IEEE 13th International Conference on Data Mining.

[37]  Byron C. Wallace,et al.  Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) , 2012, 2012 IEEE 12th International Conference on Data Mining.

[38]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[39]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[40]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.