Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks

Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  D. Eisenberg,et al.  The hydrophobic moment detects periodicity in protein hydrophobicity. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[5]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Anders Krogh Hidden Markov models for labeled sequences , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[7]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[8]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[9]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[10]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[11]  G. Tusnády,et al.  Principles governing amino acid composition of integral membrane proteins: application to topology prediction. , 1998, Journal of molecular biology.

[12]  Anders Krogh,et al.  Prediction of Signal Peptides and Signal Anchors by a Hidden Markov Model , 1998, ISMB.

[13]  Michael I. Jordan Graphical Models , 1998 .

[14]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[15]  István Simon,et al.  The HMMTOP transmembrane topology prediction server , 2001, Bioinform..

[16]  Jaak Vilo,et al.  Prediction of the coupling specificity of G protein coupled receptors to their G proteins , 2001, ISMB.

[17]  James D. Park,et al.  MAP Complexity Results and Approximation Methods , 2002, UAI.

[18]  M. Delorenzi,et al.  An HMM model for coiled-coil domains and a comparison with PSSM-based predictions , 2002, Bioinform..

[19]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  A. Krogh,et al.  Reliability measures for membrane protein topology prediction algorithms. , 2003, Journal of molecular biology.

[21]  A. Krogh,et al.  A combined transmembrane topology and signal peptide prediction method. , 2004, Journal of molecular biology.

[22]  A. Elofsson,et al.  Best α‐helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information , 2004 .

[23]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[24]  Lars Malmström,et al.  The Yeast Resource Center Public Data Repository , 2004, Nucleic Acids Res..

[25]  Erik L. L. Sonnhammer,et al.  An HMM posterior decoder for sequence feature prediction that includes homology information , 2005, ISMB.

[26]  G. Heijne,et al.  Recognition of transmembrane helices by the endoplasmic reticulum translocon , 2005, Nature.

[27]  Jeff A. Bilmes,et al.  Part-of-Speech Tagging using Virtual Evidence and Negative Training , 2005, HLT.

[28]  Piero Fariselli,et al.  A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins , 2005, BMC Bioinformatics.

[29]  G. von Heijne,et al.  Global Topology Analysis of the Escherichia coli Inner Membrane Proteome , 2005, Science.

[30]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[31]  E. Sonnhammer,et al.  A novel transmembrane topology of presenilin based on reconciling experimental and computational evidence , 2005, The FEBS journal.

[32]  Arne Elofsson,et al.  PONGO: a web server for multiple predictions of all-alpha transmembrane proteins , 2006, Nucleic Acids Res..

[33]  G. von Heijne,et al.  A global topology map of the Saccharomyces cerevisiae membrane proteome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[34]  B. de Strooper,et al.  Presenilin-1 Maintains a Nine-Transmembrane Topology throughout the Secretory Pathway* , 2006, Journal of Biological Chemistry.

[35]  Jeff A. Bilmes,et al.  What HMMs Can Do , 2006, IEICE Trans. Inf. Syst..

[36]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[37]  Erik L. L. Sonnhammer,et al.  Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server , 2007, Nucleic Acids Res..

[38]  Xin-Qiu Yao,et al.  A dynamic Bayesian network approach to protein secondary structure prediction , 2008, BMC Bioinformatics.

[39]  G. von Heijne,et al.  Membrane protein structure: prediction versus reality. , 2007, Annual review of biochemistry.

[40]  G. Heijne,et al.  Molecular code for transmembrane-helix recognition by the Sec61 translocon , 2007, Nature.

[41]  Hongbin Shen,et al.  MemBrain: Improving the Accuracy of Predicting Transmembrane Helices , 2008, PloS one.

[42]  G. von Heijne,et al.  Prediction of membrane-protein topology from first principles , 2008, Proceedings of the National Academy of Sciences.

[43]  David S. Wishart,et al.  PROTEUS2: a web server for comprehensive protein structure prediction and structure-based annotation , 2008, Nucleic Acids Res..