Bayesian segmental models with multiple sequence alignment profiles for protein secondary structure and contact map prediction

In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in beta-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html

[1]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[2]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[3]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[4]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[5]  D. Eisenberg,et al.  The hydrophobic moment detects periodicity in protein hydrophobicity. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[6]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[9]  Simon Kasif,et al.  Protein Secondary-Structure Modeling with Probabilistic Networks , 1993, ISMB.

[10]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[11]  A. Delcher,et al.  Protein secondary structure modelling with probabilistic networks. , 1993, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[12]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[13]  D. Brutlag,et al.  Discovering structural correlations in α‐helices , 1994 .

[14]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[15]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[16]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[17]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[18]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[19]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[20]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[21]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[22]  R. Aurora,et al.  Helix capping , 1998, Protein science : a publication of the Protein Society.

[23]  B. Rost,et al.  A modified definition of Sov, a segment‐based measure for protein secondary structure prediction assessment , 1999, Proteins.

[24]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[25]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[26]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[27]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[28]  Ian Korf,et al.  Integrating genomic homology into gene structure prediction , 2001, ISMB.

[29]  B Rost,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[30]  C. Burge,et al.  Computational inference of homologous gene structures in the human genome. , 2001, Genome research.

[31]  Douglas L. Brutlag,et al.  Bayesian Protein Structure Prediction , 2002 .

[32]  Douglas L. Brutlag,et al.  Statistical models and monte carlo methods for protein structure prediction , 2002 .

[33]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[34]  Pierre Baldi,et al.  Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners , 2002, ISMB.

[35]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[36]  D. Haussler,et al.  Information‐theoretic dissection of pairwise contact potentials , 2002, Proteins.

[37]  Adam Godzik,et al.  A segment alignment approach to protein comparison , 2003, Bioinform..

[38]  S. Kasif,et al.  Human-mouse gene identification by comparative evidence integration and evolutionary analysis. , 2003, Genome research.

[39]  Wei Chu,et al.  Protein secondary structure prediction using sigmoid belief networks to parameterize segmental semi-Markov models , 2004, ESANN.

[40]  A. Krogh,et al.  Teaching computers to fold proteins. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[41]  Steven E Brenner,et al.  Measurements of protein sequence–structure correlations , 2004, Proteins.

[42]  Yücel Altunbasak,et al.  Protein secondary structure prediction with semi Markov HMMs , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[43]  George D. Rose,et al.  Steric restrictions in protein folding: An α‐helix cannot be followed by a contiguous β‐strand , 2004 .

[44]  G. Crooks,et al.  Protein secondary structure: entropy, correlations and prediction. , 2003, Bioinformatics.

[45]  Max Welling Donald,et al.  Products of Experts , 2007 .