Bayesian Protein Secondary Structure Prediction With Near-Optimal Segmentations

Secondary structure prediction is an invaluable tool in determining the 3-D structure and function of proteins. Typically, protein secondary structure prediction methods suffer from low accuracy in beta-strand predictions, where nonlocal interactions play a significant role. There is a considerable need to model such long- range interactions that contribute to the stabilization of a protein molecule. In this paper, we introduce an alternative decoding technique for the hidden semi-Markov model (HSMM) originally employed in the BSPSS algorithm, and further developed in the IPSSP algorithm. The proposed method is based on the N-best paradigm where a set of most likely segmentations is computed. To generate suboptimal segmentations (i.e., alternative prediction sequences), we developed two N-best search algorithms. The first one is an A* stack decoder algorithm that extends paths (or hypotheses) by one symbol at each iteration. The second algorithm locally keeps the end positions of the highest scoring K previous segments and performs backtracking. Both algorithms employ the hidden semi- Markov model described in Aydin etal. [5], and use Viterbi scoring to compute the N-best list. The availability of near-optimal segmentations and the utilization of the Viterbi scoring enable the sequences to be rescored using more complex dependency models that characterize nonlocal interactions in beta-sheets. After the score update, one can either keep the segmentations to be employed in 3-D structure prediction or predict the secondary structure by applying a weighted voting procedure to a set of top scoring M ges 1 segmentations. The accuracy measures of the N-best method when used to predict the secondary structure are shown to be comparable or better than the classical Viterbi decoder (MAP estimator), tested under the single-sequence condition. When no rescoring is applied, the stack decoder algorithm with sufficiently large M improves the overall sensitivity measure (Q3) of the Viterbi algorithm by 1.1%. At the same M value, the N-best Viterbi algorithm improves the Q3 measure by 0.25% as well as the sensitivity measures specific for each secondary structure type (Qobs alpha, Qobs beta, Qobs L). When the sequences are rescored using the posterior probability distribution computed by the posterior decoding algorithm (MPM estimator), N-best Viterbi improves the Q3 measure of the Viterbi algorithm by 2.6%. The rescored N-best list approach also enables us to generate suboptimal segmentations that are valid sequences (i.e., realizable from the hidden semi-Markov model). Although the N-best algorithms and the score update procedure brought significant improvements over the Viterbi algorithm, they were not able to outperform the posterior decoding algorithm in the single-sequence condition. Further improvements in the prediction accuracy should be possible with the incorporation of sophisticated models for nonlocal interactions and other physical constraints that stabilize the overall structure of a protein.

[1]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[2]  Simon Cawley,et al.  HMM sampling and applications to gene finding and alternative splicing , 2003, ECCB.

[3]  M J Sternberg,et al.  A simple method to generate non-trivial alternate alignments of protein sequences. , 1991, Journal of molecular biology.

[4]  Silvio C. E. Tosatto,et al.  MANIFOLD: protein fold recognition based on secondary structure, sequence similarity and enzyme classification. , 2003, Protein engineering.

[5]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[6]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[7]  Yücel Altunbasak,et al.  Protein secondary structure prediction with semi Markov HMMs , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[8]  V. Thorsson,et al.  HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. , 2000, Journal of molecular biology.

[9]  Pierre Baldi,et al.  Three-stage prediction of protein ?-sheets by neural networks, alignments and graph algorithms , 2005, ISMB.

[10]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[11]  Gianluca Pollastri,et al.  Combining protein secondary structure prediction models with ensemble methods of optimal complexity , 2004, Neurocomputing.

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  BaldiPierre,et al.  Three-stage prediction of protein β-sheets by neural networks, alignments and graph algorithms , 2005 .

[14]  Ronald M. Levy,et al.  Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases , 2000, Bioinform..

[15]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[16]  Giovanni Soda,et al.  Exploiting the past and the future in protein secondary structure prediction , 1999, Bioinform..

[17]  Silvio C. E. Tosatto,et al.  The SSEA server for protein secondary structure alignment , 2005, Bioinform..

[18]  María S. Pérez-Hernández,et al.  Bayesian network multi-classifiers for protein secondary structure prediction , 2004, Artif. Intell. Medicine.

[19]  Guy M. McKhann,et al.  Biochemistry. 3rd edition , 1988, The Yale Journal of Biology and Medicine.

[20]  Wei Chu,et al.  Bayesian segmental models with multiple sequence alignment profiles for protein secondary structure and contact map prediction , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[22]  Temple F. Smith,et al.  Protein fold recognition by total alignment probability , 2000, Proteins.

[23]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[24]  Jean Garnier,et al.  FORESST: fold recognition from secondary structure predictions of proteins , 1999, Bioinform..

[25]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[26]  Richard A Friesner,et al.  A novel fold recognition method using composite predicted secondary structures , 2002, Proteins.

[27]  Jacob Goldberger,et al.  Sequentially finding the N-Best List in Hidden Markov Models , 2001, IJCAI.

[28]  Burkhard Rost,et al.  Rising Accuracy of Protein Secondary Structure Prediction , 2003 .

[29]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[30]  Nils J. Nilsson,et al.  Problem-solving methods in artificial intelligence , 1971, McGraw-Hill computer science series.

[31]  Wynne Hsu,et al.  Remote homolog detection using local sequence–structure correlations , 2004, Proteins.

[32]  Stavros J. Hamodrakas,et al.  A Hidden Markov Model method, capable of predicting and discriminating β-barrel outer membrane proteins , 2004, BMC Bioinformatics.

[33]  B. Rost,et al.  Redefining the goals of protein secondary structure prediction. , 1994, Journal of molecular biology.

[34]  Douglas B. Paul An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[35]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[36]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[37]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[38]  L. Mirny,et al.  Protein structure prediction by threading. Why it works and why it does not. , 1998, Journal of molecular biology.

[39]  R. Schwartz,et al.  A comparison of several approximate algorithms for finding multiple (N-best) sentence hypotheses , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[40]  Lalit R. Bahl,et al.  A tree search strategy for large-vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[41]  Douglas L. Brutlag,et al.  Bayesian Protein Structure Prediction , 2002 .

[42]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[43]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[44]  Wei Chu,et al.  A graphical model for protein secondary structure prediction , 2004, ICML.

[45]  Kai Wang,et al.  FSSA: a novel method for identifying functional signatures from structural alignments , 2005, Bioinform..

[46]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[47]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[48]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[49]  Richard Bonneau,et al.  Distributions of beta sheets in proteins with application to structure prediction , 2002, Proteins.