Boosting the accuracy of protein secondary structure prediction through nearest neighbor search and method hybridization

Abstract Motivation Protein secondary structure prediction is a fundamental precursor to many bioinformatics tasks. Nearly all state-of-the-art tools when computing their secondary structure prediction do not explicitly leverage the vast number of proteins whose structure is known. Leveraging this additional information in a so-called template-based method has the potential to significantly boost prediction accuracy. Method We present a new hybrid approach to secondary structure prediction that gains the advantages of both template- and non-template-based methods. Our core template-based method is an algorithmic approach that uses metric-space nearest neighbor search over a template database of fixed-length amino acid words to determine estimated class-membership probabilities for each residue in the protein. These probabilities are then input to a dynamic programming algorithm that finds a physically valid maximum-likelihood prediction for the entire protein. Our hybrid approach exploits a novel accuracy estimator for our core method, which estimates the unknown true accuracy of its prediction, to discern when to switch between template- and non-template-based methods. Results On challenging CASP benchmarks, the resulting hybrid approach boosts the state-of-the-art Q8 accuracy by more than 2–10%, and Q3 accuracy by more than 1–3%, yielding the most accurate method currently available for both 3- and 8-state secondary structure prediction. Availability and implementation A preliminary implementation in a new tool we call Nnessy is available free for non-commercial use at http://nnessy.cs.arizona.edu.

[1]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[2]  Yue Lu,et al.  Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences , 2007, RECOMB.

[3]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[4]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[5]  Jeff G. Schneider,et al.  Protein subcellular location pattern classification in cellular images using latent discriminative models , 2012, Bioinform..

[6]  K. Dill,et al.  The Protein-Folding Problem, 50 Years On , 2012, Science.

[7]  Yaoqi Zhou,et al.  Achieving 80% ten‐fold cross‐validated accuracy for secondary structure prediction by large‐scale training , 2006, Proteins.

[8]  Xin Deng,et al.  MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts , 2011, BMC Bioinformatics.

[9]  Yanjun Qi,et al.  A Unified Multitask Architecture for Predicting Local Protein Properties , 2012, PloS one.

[10]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[11]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[12]  August E. Woerner,et al.  On the Neutralome of Great Apes and Nearest Neighbor Search in Metric Spaces , 2016 .

[13]  Yihui Liu,et al.  Protein Secondary Structure Prediction Based on Data Partition and Semi-Random Subspace Method , 2018, Scientific Reports.

[14]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[15]  Jianlin Cheng,et al.  A Deep Learning Network Approach to ab initio Protein Secondary Structure Prediction , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Yaohang Li,et al.  Context-Based Features Enhance Protein Secondary Structure Prediction Accuracy , 2014, J. Chem. Inf. Model..

[17]  John D. Kececioglu,et al.  Aligning Protein Sequences with Predicted Secondary Structure , 2010, J. Comput. Biol..

[18]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[19]  Dapeng Li,et al.  A novel structural position-specific scoring matrix for the prediction of protein secondary structures , 2012, Bioinform..

[20]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[21]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[22]  Gianluca Pollastri,et al.  Porter, PaleAle 4.0: high-accuracy prediction of protein secondary structure and relative solvent accessibility , 2013, Bioinform..

[23]  R. L. Jernigan,et al.  Fast learning optimized prediction methodology (FLOPRED) for protein secondary structure prediction , 2012, Journal of Molecular Modeling.

[24]  Dan F. DeBlasio Parameter advising for multiple sequence alignment , 2015, BMC Bioinformatics.

[25]  Christian Cole,et al.  JPred4: a protein secondary structure prediction server , 2015, Nucleic Acids Res..

[26]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[27]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[28]  R. Spang,et al.  Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. , 2002, Molecular biology and evolution.