Identification of putative domain linkers by a neural network – application to a large sequence database

BackgroundThe reliable dissection of large proteins into structural domains represents an important issue for structural genomics/proteomics projects. To provide a practical approach to this issue, we tested the ability of neural network to identify domain linkers from the SWISSPROT database (101602 sequences).ResultsOur search detected 3009 putative domain linkers adjacent to or overlapping with domains, as defined by sequence similarity to either Protein Data Bank (PDB) or Conserved Domain Database (CDD) sequences. Among these putative linkers, 75% were "correctly" located within 20 residues of a domain terminus, and the remaining 25% were found in the middle of a domain, and probably represented failed predictions. Moreover, our neural network predicted 5124 putative domain linkers in structurally un-annotated regions without sequence similarity to PDB or CDD sequences, which suggest to the possible existence of novel structural domains. As a comparison, we performed the same analysis by identifying low-complexity regions (LCR), which are known to encode unstructured polypeptide segments, and observed that the fraction of LCRs that correlate with domain termini is similar to that of domain linkers. However, domain linkers and LCRs appeared to identify different types of domain boundary regions, as only 32% of the putative domain linkers overlapped with LCRs.ConclusionOverall, our study indicates that the two methods detect independent and complementary regions, and that the combination of these methods can substantially improve the sensitivity of the domain boundary prediction. This finding should enable the identification of novel structural domains, yielding new targets for large scale protein analyses.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[2]  J M Chandonia,et al.  Neural networks for secondary structure and structural class predictions , 1995, Protein science : a publication of the Protein Society.

[3]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[4]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[5]  Lesheng Kong,et al.  Delineation of modular proteins: Domain boundary prediction from sequence information , 2004, Briefings Bioinform..

[6]  Daniel J Rigden,et al.  Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. , 2002, Protein engineering.

[7]  D Gorse,et al.  Prediction of the location and type of β‐turns in proteins using neural networks , 1999, Protein science : a publication of the Protein Society.

[8]  R. A. George,et al.  Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.

[9]  Miroslaw Cygler,et al.  Coverage of protein sequence space by current structural genomics targets , 2004, Journal of Structural and Functional Genomics.

[10]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[11]  Steven E. Brenner,et al.  The PRESAGE database for structural genomics , 1999, Nucleic Acids Res..

[12]  K. Nagano Logical analysis of the mechanism of protein folding. I. Predictions of helices, loops and beta-structures from primary structure. , 1973, Journal of molecular biology.

[13]  Yoshihisa Hagihara,et al.  Toward development of a screen to identify randomly encoded, foldable sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  H. Scheraga,et al.  Prediction of the location of structural domains in globular proteins , 1988, Journal of protein chemistry.

[15]  C P Ponting,et al.  The domain organization of streptokinase: Nuclear magnetic resonance, circular dichroism, and functional characterization of proteolytic fragments , 1996, Protein science : a publication of the Protein Society.

[16]  Yutaka Kuroda,et al.  Structural genomics projects in Japan , 2000, Nature Structural Biology.

[17]  Y. Matsuo,et al.  Automated search of natively folded protein fragments for high‐throughput structure determination in structural genomics , 2000, Protein science : a publication of the Protein Society.

[18]  Yutaka Kuroda,et al.  Computer‐aided NMR assay for detecting natively folded structural domains , 2006, Protein science : a publication of the Protein Society.

[19]  C. Vita,et al.  Folding of thermolysin fragments. Identification of the minimum size of a carboxyl-terminal fragment that can fold into a stable native-like structure. , 1985, Journal of molecular biology.

[20]  Sung-Hou Kim Shining a light on structural genomics , 1998, Nature Structural Biology.

[21]  H A Scheraga,et al.  Predictions of structural homologies in cytochrome c proteins. , 1971, Archives of biochemistry and biophysics.

[22]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[23]  A. Fiser,et al.  Stabilization centers in proteins: identification, characterization and predictions. , 1997, Journal of molecular biology.

[24]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[25]  G. Winter,et al.  Identification of protein domains by shotgun proteolysis. , 2006, Journal of molecular biology.

[26]  P. Y. Chou,et al.  Prediction of protein conformation. , 1974, Biochemistry.

[27]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[28]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[29]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[30]  K. Nagano,et al.  Logical analysis of the mechanism of protein folding. IV. Super-secondary structures. , 1977, Journal of molecular biology.

[31]  Yutaka Kuroda,et al.  Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. , 2006, Biopolymers.

[32]  T. Terwilliger,et al.  Rapid protein-folding assay using green fluorescent protein , 1999, Nature Biotechnology.

[33]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[34]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[35]  John C. Wootton,et al.  Non-globular Domains in Protein Sequences: Automated Segmentation Using Complexity Measures , 1994, Comput. Chem..

[36]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[37]  K Wüthrich,et al.  NMR spectroscopy of large molecules and multimolecular assemblies in solution. , 1999, Current opinion in structural biology.

[38]  L Shapiro,et al.  The Argonne Structural Genomics Workshop: Lamaze class for the birth of a new science. , 1998, Structure.

[39]  D Eisenberg,et al.  Selecting protein targets for structural genomics of Pyrobaculum aerophilum: validating automated fold assignment methods by using binary hypothesis testing. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[40]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[42]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[43]  Zukang Feng,et al.  The Protein Data Bank and structural genomics , 2003, Nucleic Acids Res..

[44]  Peer Bork,et al.  SMART: a web-based tool for the study of genetically mobile domains , 2000, Nucleic Acids Res..

[45]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[46]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[47]  S. Hubbard,et al.  The structural aspects of limited proteolysis of native proteins. , 1998, Biochimica et biophysica acta.

[48]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[49]  Yutaka Kuroda,et al.  Characteristics and prediction of domain linker sequences in multi-domain proteins , 2004, Journal of Structural and Functional Genomics.

[50]  Y. Matsuo,et al.  Structural genomics projects in Japan. , 2000, Progress in biophysics and molecular biology.

[51]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[52]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[53]  Yutaka Kuroda,et al.  Characterization and prediction of linker sequences of multi-domain proteins by a neural network , 2004, Journal of Structural and Functional Genomics.