Prediction of protein domain boundaries from inverse covariances

It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing‐based approach and methods based on building alpha‐carbon models and compare performance with a length‐based predictor, a homology search method and four published sequence‐based predictors: DOMCUT, DomPRO, DLP‐SVM, and SCOOBY‐DOmain. We show that the kernel‐smoothing method is significantly better than the other ab initio predictors when both single‐domain and multidomain targets are considered and is not significantly different to the homology‐based method. Considering only multidomain targets the kernel‐smoothing method outperforms all of the published methods except DLP‐SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction. Proteins 2013. © 2012 Wiley Periodicals, Inc.

[1]  Jaap Heringa,et al.  Identifying foldable regions in protein sequence from the hydrophobic signal , 2007, Nucleic acids research.

[2]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[3]  Daniel J Rigden,et al.  Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. , 2002, Protein engineering.

[4]  Leonard J. Banaszak,et al.  Polypeptide conformation of cytoplasmic malate dehydrogenase from an electron density map at 3.0 Å resolution , 1972 .

[5]  W R Taylor,et al.  Protein structural domain identification. , 1999, Protein engineering.

[6]  Ying Xu,et al.  Protein domain decomposition using a graph-theoretic approach , 2000, Bioinform..

[7]  D. Phillips,et al.  THE HEN EGG-WHITE LYSOZYME MOLECULE , 1967 .

[8]  D. M. Blow,et al.  6 The Structure of Chymotrypsin , 1971 .

[9]  David T. Jones,et al.  Getting the most from PSI-BLAST. , 2002, Trends in biochemical sciences.

[10]  Woei-Jyh Lee,et al.  Evaluation of domain prediction in CASP6 , 2005, Proteins.

[11]  Ilya N. Shindyalov,et al.  PDP: protein domain parser , 2003, Bioinform..

[12]  Yaoqi Zhou,et al.  DDOMAIN: Dividing structures into domains using a normalized domain–domain interaction profile , 2007, Protein science : a publication of the Protein Society.

[13]  J. Kraut,et al.  15 Subtilisin: X-Ray Structure , 1971 .

[14]  Liam J McGuffin,et al.  Assembling novel protein folds from super‐secondary structural fragments , 2003, Proteins.

[15]  Benoit H. Dessailly,et al.  Exploiting structural classifications for function prediction: towards a domain grammar for protein function. , 2009, Current opinion in structural biology.

[16]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[17]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[18]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[19]  L. S. Shapley,et al.  College Admissions and the Stability of Marriage , 2013, Am. Math. Mon..

[20]  William R. Taylor,et al.  Structural Constraints on the Covariance Matrix Derived from Multiple Aligned Protein Sequences , 2011, PloS one.

[21]  Jan Drenth,et al.  13 Papain, X-Ray Structure , 1971 .

[22]  D W Rice,et al.  Phosphoglycerate kinase. , 1981, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[23]  D Tsernoglou,et al.  Polypeptide conformation of cytoplasmic malate dehydrogenase from an electron density map at 3.0 angstrom resolution. , 1972, Journal of molecular biology.

[24]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[25]  Vichetra Sam,et al.  Protein domain assignment from the recurrence of locally similar structures , 2011, Proteins.

[26]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[27]  Piero Fariselli,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2008, IEEE ACM Trans. Comput. Biol. Bioinform..

[28]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[29]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[30]  S. Teichmann,et al.  The folding and evolution of multidomain proteins , 2007, Nature Reviews Molecular Cell Biology.

[31]  William R. Taylor,et al.  Direct correlation analysis improves fold recognition , 2011, Comput. Biol. Chem..

[32]  B P Schoenborn,et al.  Three-dimensional structure of thermolysin. , 1972, Nature: New biology.

[33]  David T. Jones,et al.  Protein topology from predicted residue contacts , 2012, Protein science : a publication of the Protein Society.

[34]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[35]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[36]  W. Taylor,et al.  Multiple sequence threading: an analysis of alignment quality and stability. , 1997, Journal of molecular biology.

[37]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[38]  B. S. Hartley,et al.  10 Pancreatic Elastase , 1971 .

[39]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[40]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[41]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.

[42]  Teppei Ebina,et al.  Loop‐length‐dependent SVM prediction of domain linkers for high‐throughput structural proteomics , 2009, Biopolymers.

[43]  Peer Bork,et al.  SMART 7: recent updates to the protein domain annotation resource , 2011, Nucleic Acids Res..

[44]  P. Evans,et al.  Structure of Horse-muscle Phosphoglycerate Kinase at 6 Å Resolution , 1972 .

[45]  Stella Veretnik,et al.  Partitioning protein structures into domains: why is it so difficult? , 2006, Journal of molecular biology.

[46]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[47]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[48]  William R Taylor,et al.  Prediction of protein structure from ideal forms , 2008, Proteins.

[49]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Kuang Lin,et al.  Scooby-domain: prediction of globular domains in protein sequence , 2005, Nucleic Acids Res..

[51]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[52]  C. Sander,et al.  Parser for protein folding units , 1994, Proteins.

[53]  Piero Fariselli,et al.  Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps , 2007, WABI.

[54]  M J Sternberg,et al.  Identification and analysis of domains in proteins. , 1995, Protein engineering.

[55]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..