Transmembrane helix prediction using amino acid property features and latent semantic analysis

BackgroundPrediction of transmembrane (TM) helices by statistical methods suffers from lack of sufficient training data. Current best methods use hundreds or even thousands of free parameters in their models which are tuned to fit the little data available for training. Further, they are often restricted to the generally accepted topology "cytoplasmic-transmembrane-extracellular" and cannot adapt to membrane proteins that do not conform to this topology. Recent crystal structures of channel proteins have revealed novel architectures showing that the above topology may not be as universal as previously believed. Thus, there is a need for methods that can better predict TM helices even in novel topologies and families.ResultsHere, we describe a new method "TMpro" to predict TM helices with high accuracy. To avoid overfitting to existing topologies, we have collapsed cytoplasmic and extracellular labels to a single state, non-TM. TMpro is a binary classifier which predicts TM or non-TM using multiple amino acid properties (charge, polarity, aromaticity, size and electronic properties) as features. The features are extracted from sequence information by applying the framework used for latent semantic analysis of text documents and are input to neural networks that learn the distinction between TM and non-TM segments. The model uses only 25 free parameters. In benchmark analysis TMpro achieves 95% segment F-score corresponding to 50% reduction in error rate compared to the best methods not requiring an evolutionary profile of a protein to be known. Performance is also improved when applied to more recent and larger high resolution datasets PDBTM and MPtopo. TMpro predictions in membrane proteins with unusual or disputed TM structure (K+ channel, aquaporin and HIV envelope glycoprotein) are discussed.ConclusionTMpro uses very few free parameters in modeling TM segments as opposed to the very large number of free parameters used in state-of-the-art membrane prediction methods, yet achieves very high segment accuracies. This is highly advantageous considering that high resolution transmembrane information is available only for very few proteins. The greatest impact of TMpro is therefore expected in the prediction of TM segments in proteins with novel topologies. Further, the paper introduces a novel method of extracting features from protein sequence, namely that of latent semantic analysis model. The success of this approach in the current context suggests that it can find potential applications in other sequence-based analysis problems.Availabilityhttp://linzer.blm.cs.cmu.edu/tmpro/ and http://flan.blm.cs.cmu.edu/tmpro/

[1]  G. Heijne,et al.  Genome‐wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms , 1998, Protein science : a publication of the Protein Society.

[2]  J.R. Bellegarda,et al.  Exploiting latent semantic information in statistical language modeling , 2000, Proceedings of the IEEE.

[3]  S H White,et al.  Energetics, stability, and prediction of transmembrane helices. , 2001, Journal of molecular biology.

[4]  N. Balakrishnan,et al.  Characterization of protein secondary structure , 2004, IEEE Signal Processing Magazine.

[5]  G. Tusnády,et al.  Principles governing amino acid composition of integral membrane proteins: application to topology prediction. , 1998, Journal of molecular biology.

[6]  M Kanehisa,et al.  Prediction of membrane proteins based on classification of transmembrane segments. , 1998, Protein engineering.

[7]  Aleksey A. Porollo,et al.  Enhanced recognition of protein transmembrane domains with prediction-based structural profiles , 2006, Bioinform..

[8]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[9]  A. Kernytsky,et al.  Transmembrane helix predictions revisited , 2002, Protein science : a publication of the Protein Society.

[10]  S H White,et al.  MPtopo: A database of membrane protein topology , 2001, Protein science : a publication of the Protein Society.

[11]  A. Elofsson,et al.  Best α‐helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information , 2004 .

[12]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[13]  Birgit Eisenhaber,et al.  TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter , 2004, Bioinform..

[14]  Masami Ikeda,et al.  Transmembrane topology prediction methods: A re-assessment and improvement by a consensus method using a dataset of experimentally-characterized transmembrane topology , 2001, Silico Biol..

[15]  Herbert R. Treutlein,et al.  Simulation of helix association in membranes: modeling the glycophorin A transmembrane domain , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[16]  V. Lingappa,et al.  Integral membrane protein biosynthesis: why topology is hard to predict. , 2002, Journal of cell science.

[17]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[18]  Y. Sugiyama,et al.  Identification of transmembrane protein functions by binary topology patterns. , 2003, Protein engineering.

[19]  Marialuisa Pellegrini-Calace,et al.  Towards genome-scale structure prediction for transmembrane proteins , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[20]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[21]  G V Nikiforovich,et al.  Isolated transmembrane helices arranged across a membrane: computational studies. , 1999, Protein engineering.

[22]  D. Doyle,et al.  Transmembrane helix prediction: a comparative evaluation and analysis. , 2005, Protein engineering, design & selection : PEDS.

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Burkhard Rost,et al.  Long membrane helices and short loops predicted less accurately , 2002, Protein science : a publication of the Protein Society.

[25]  Zsuzsanna Dosztányi,et al.  PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank , 2004, Nucleic Acids Res..

[26]  Sarel J Fleishman,et al.  Transmembrane protein structures without X-rays. , 2006, Trends in biochemical sciences.

[27]  B. Rost,et al.  Topology prediction for helical transmembrane proteins at 86% accuracy–Topology prediction at 86% accuracy , 1996, Protein science : a publication of the Protein Society.

[28]  Burkhard Rost,et al.  Static benchmarking of membrane helix predictions , 2003, Nucleic Acids Res..

[29]  N. Ben-Tal,et al.  kPROT: a knowledge-based scale for the propensity of residue orientation in transmembrane segments. Application to membrane protein structure prediction. , 1999, Journal of molecular biology.

[30]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[31]  Shigeki Mitaku,et al.  SOSUI: classification and secondary structure prediction system for membrane proteins , 1998, Bioinform..

[32]  Juan Jesús Pérez,et al.  BUNDLE: A program for building the transmembrane domains of G-protein-coupled receptors , 1998, J. Comput. Aided Mol. Des..

[33]  B. Chait,et al.  The structure of the potassium channel: molecular basis of K+ conduction and selectivity. , 1998, Science.

[34]  N. Dimmock,et al.  The C-terminal tail of the gp41 transmembrane envelope glycoprotein of HIV-1 clades A, B, C, and D may exist in two conformations: an analysis of sequence, structure, and function , 2005, Virology.

[35]  Andreas Engel,et al.  Structural determinants of water permeation through aquaporin-1 , 2000, Nature.

[36]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation (3rd Edition) , 2007 .

[37]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[38]  A Elofsson,et al.  Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. , 1997, Protein engineering.

[39]  S H White,et al.  Global statistics of protein sequences: implications for the origin, evolution, and prediction of structure. , 1994, Annual review of biophysics and biomolecular structure.

[40]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[41]  A. Krogh,et al.  Reliability measures for membrane protein topology prediction algorithms. , 2003, Journal of molecular biology.

[42]  Janet M Thornton,et al.  Computational analysis of alpha-helical membrane protein structure: implications for the prediction of 3D structural models. , 2004, Protein engineering, design & selection : PEDS.

[43]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[44]  Václav Hlaváč,et al.  Statistical Pattern Recognition Toolbox for Matlab User's guide , 2004 .

[45]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.