Correct machine learning on protein sequences: a peer-reviewing perspective

Machine learning methods are becoming increasingly popular to predict protein features from sequences. Machine learning in bioinformatics can be powerful but carries also the risk of introducing unexpected biases, which may lead to an overestimation of the performance. This article espouses a set of guidelines to allow both peer reviewers and authors to avoid common machine learning pitfalls. Understanding biology is necessary to produce useful data sets, which have to be large and diverse. Separating the training and test process is imperative to avoid over-selling method performance, which is also dependent on several hidden parameters. A novel predictor has always to be compared with several existing methods, including simple baseline strategies. Using the presented guidelines will help nonspecialists to appreciate the critical issues in machine learning.

[1]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[2]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[3]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[4]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[5]  Malmqvist,et al.  Epitope Mapping by Label-Free Biomolecular Interaction Analysis , 1996, Methods.

[6]  Subhadip Basu,et al.  AMS 4.0: consensus prediction of post-translational modifications in protein sequences , 2012, Amino Acids.

[7]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[8]  George M. Church,et al.  Predicting Protein Post-translational Modifications Using Meta-analysis of Proteome Scale Data Sets*S , 2009, Molecular & Cellular Proteomics.

[9]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[10]  Daniel Schwartz,et al.  Prediction of lysine post-translational modifications using bioinformatic tools. , 2012, Essays in biochemistry.

[11]  Arne Elofsson,et al.  Structure prediction meta server , 2001, Bioinform..

[12]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[15]  Yu Xue,et al.  MeMo: a web tool for prediction of protein methylation modifications , 2006, Nucleic Acids Res..

[16]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[17]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[18]  Arne Elofsson,et al.  TOPCONS: consensus prediction of membrane protein topology , 2009, Nucleic Acids Res..

[19]  Giorgio Valle,et al.  Simple consensus procedures are effective and sufficient in secondary structure prediction. , 2003, Protein engineering.

[20]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[21]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[22]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[25]  Silvio C. E. Tosatto,et al.  TESE: generating specific protein structure test set ensembles , 2008, Bioinform..

[26]  Marcel J. T. Reinders,et al.  Pattern recognition in bioinformatics , 2013, Briefings Bioinform..

[27]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[28]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[29]  Geoffrey J. Barton,et al.  JPred : a consensus secondary structure prediction server , 1999 .

[30]  Alex Bateman,et al.  The rise and fall of supervised machine learning techniques , 2011, Bioinform..

[31]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[32]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[34]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[35]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[36]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[37]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[38]  T. Schwede,et al.  QMEANclust: estimation of protein model quality by combining a composite scoring function with structural density information , 2009, BMC Structural Biology.

[39]  Burkhard Rost,et al.  PHD - an automatic mail server for protein secondary structure prediction , 1994, Comput. Appl. Biosci..

[40]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[41]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[42]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[43]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[44]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[45]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[46]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .