Prediction of protein interdomain linker regions by a hidden Markov model

MOTIVATION Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output. RESULTS We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.

[1]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[2]  C. Khosla,et al.  Role of linkers in communication between protein modules. , 2000, Current opinion in chemical biology.

[3]  R. Sauer,et al.  Optimizing the stability of single-chain proteins by linker length and composition mutagenesis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[5]  Wouter de Laat,et al.  Linker length and composition influence the flexibility of Oct‐1 DNA binding , 1997, The EMBO journal.

[6]  Yutaka Kuroda,et al.  Characteristics and prediction of domain linker sequences in multi-domain proteins , 2004, Journal of Structural and Functional Genomics.

[7]  S J Wodak,et al.  Identification of structural domains in proteins by a graph heuristic , 1999, Proteins.

[8]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[9]  W R Taylor,et al.  Protein structural domain identification. , 1999, Protein engineering.

[10]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[11]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[12]  Jérôme Gouzy,et al.  XDOM, a graphical tool to analyse domain arrangements in any set of protein sequences , 1997, Comput. Appl. Biosci..

[13]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Burkhard Rost,et al.  CHOP: parsing proteins into structural domains , 2004, Nucleic Acids Res..

[15]  Daniel W. Udwary,et al.  A method for prediction of the locations of linker regions within large multifunctional proteins, and application to a type I polyketide synthase. , 2002, Journal of molecular biology.

[16]  C. Sander,et al.  Parser for protein folding units , 1994, Proteins.

[17]  S. Chib,et al.  Bayes inference via Gibbs sampling of autoregressive time series subject to Markov mean and variance shifts , 1993 .

[18]  Satoru Hayamizu,et al.  HMM with protein structure grammar , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[19]  Rolf Apweiler,et al.  Functional Information in SWISS-PROT: the Basis for Large-scale Characterisation of Protein Sequences , 2001, Briefings Bioinform..

[20]  Jérôme Gracy,et al.  Automated protein sequence database classification. II. Delineation Of domain boundaries from sequence similarities , 1998, Bioinform..

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Douglas L. Brutlag,et al.  Bayesian Protein Structure Prediction , 2002 .

[23]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[24]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[25]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[26]  A. Krogh Hidden Markov Models in Computational Biology Applications to Protein Modeling UCSC CRL , 1993 .

[27]  M J Sternberg,et al.  Identification and analysis of domains in proteins. , 1995, Protein engineering.

[28]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[29]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[30]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[31]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[32]  Yutaka Kuroda,et al.  Characterization and prediction of linker sequences of multi-domain proteins by a neural network , 2004, Journal of Structural and Functional Genomics.

[33]  C. Robert,et al.  Reparameterisation issues in mixture modelling and their bearing on MCMC algorithms , 1999 .

[34]  G J Barton,et al.  Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions , 1995, Protein science : a publication of the Protein Society.

[35]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[36]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[37]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[38]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[39]  E. Sonnhammer,et al.  Modular arrangement of proteins as inferred from analysis of homology , 1994, Protein science : a publication of the Protein Society.

[40]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[41]  P Argos,et al.  An investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion. , 1990, Journal of molecular biology.

[42]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[43]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[44]  R. A. George,et al.  Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.

[45]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .