Improving the protein fold recognition accuracy of a reduced state-space hidden Markov model

Fold recognition is a challenging field strongly associated with protein function determination, which is crucial for biologists and the pharmaceutical industry. Hidden Markov models (HMMs) have been widely used for this purpose. In this paper we demonstrate how the fold recognition performance of a recently introduced HMM with a reduced state-space topology can be improved. Our method employs an efficient architecture and a low complexity training algorithm based on likelihood maximization. The fold recognition performance of the model is further improved in two steps. In the first step we use a smaller model architecture based on the {E,H,L} alphabet instead of the DSSP secondary structure alphabet. In the second step secondary structure information (predicted or true) is additionally used in scoring the test set sequences. The Protein Data Bank and the annotation of the SCOP database are used for the training and evaluation of the proposed methodology. The results show that the fold recognition accuracy is substantially improved in both steps. Specifically, it is increased by 2.9% in the first step to 22%. In the second step it further increases and reaches up to 30% when predicted secondary structure information is additionally used and it increases even more and reaches up to 34.7% when we use the true secondary structure. The major advantage of the proposed improvements is that the fold recognition performance is substantially increased while the size of the model and the computational complexity of scoring are decreased.

[1]  D. Whitford,et al.  Proteins: Structure and Function , 2005, Annals of Biomedical Engineering.

[2]  Richard Hughey,et al.  Calibrating E-values for hidden Markov models using reverse-sequence null models , 2005, Bioinform..

[3]  K. Karplus,et al.  Hidden Markov models that use predicted local structure for fold recognition: Alphabets of backbone geometry , 2003, Proteins.

[4]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[5]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[6]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[7]  Thomas Lengauer,et al.  BMC Bioinformatics Methodology article Local protein structure prediction using discriminative models , 2006 .

[8]  M J Sippl,et al.  Protein folds from pair interactions: A blind test in fold recognition , 1997, Proteins.

[9]  J. Skolnick,et al.  Ab initio folding of proteins using restraints derived from evolutionary information , 1999, Proteins.

[10]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[11]  Jaime G. Carbonell,et al.  Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs) , 2006, J. Comput. Biol..

[12]  Yorgos Goletsis,et al.  Sequence-based protein structure prediction using a reduced state-space hidden Markov model , 2007, Comput. Biol. Medicine.

[13]  B. Rost,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round 6 , 2005 .

[14]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[15]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[16]  C. Orengo,et al.  Analysis and assessment of ab initio three‐dimensional prediction, secondary structure, and contacts prediction , 1999, Proteins.

[17]  Yan Liang,et al.  Protein fold recognition with support vector machines fusion network , 2006 .

[18]  Jason Weston,et al.  SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition , 2007, BMC Bioinformatics.

[19]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[20]  P Argos,et al.  Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions. , 1996, Journal of molecular biology.

[21]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[22]  A. Elofsson,et al.  Hidden Markov models that use predicted secondary structures for fold recognition , 1999, Proteins.

[23]  Jinbo Xu Fold recognition by predicted alignment accuracy , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  A. Murzin Structure classification‐based assessment of CASP3 predictions for the fold recognition targets , 1999, Proteins.

[25]  C Sander,et al.  Predicting protein structure using hidden Markov models , 1997, Proteins.

[26]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[27]  Yang Zhang,et al.  The protein structure prediction problem could be solved using the current PDB library. , 2005, Proceedings of the National Academy of Sciences of the United States of America.