Loop‐length‐dependent SVM prediction of domain linkers for high‐throughput structural proteomics

The prediction of structural domains in novel protein sequences is becoming of practical importance. One important area of application is the development of computer‐aided techniques for identifying, at a low cost, novel protein domain targets for large‐scale functional and structural proteomics. Here, we report a loop‐length‐dependent support vector machine (SVM) prediction of domain linkers, which are loops separating two structural domains. (DLP‐SVM is freely available at: http://www.tuat.ac.jp/∼domserv/cgi‐bin/DLP‐SVM.cgi.) We constructed three loop‐length‐dependent SVM predictors of domain linkers (SVM‐All, SVM‐Long and SVM‐Short), and also built SVM‐Joint, which combines the results of SVM‐Short and SVM‐Long into a single consolidated prediction. The performances of SVM‐Joint were, in most aspects, the highest, with a sensitivity of 59.7% and a specificity of 43.6%, which indicated that the specificity and the sensitivity were improved by over 2 and 3% respectively, when loop‐length‐dependent characteristics were taken into account. Furthermore, the sensitivity and specificity of SVM‐Joint were, respectively, 37.6 and 17.4% higher than those of a random guess, and also superior to those of previously reported domain linker predictors. These results indicate that SVMs can be used to predict domain linkers, and that loop‐length‐dependent characteristics are useful for improving SVM prediction performances. © 2008 Wiley Periodicals, Inc. Biopolymers (Pept Sci) 92: 1–8, 2009.

[1]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[2]  Rajani R. Joshi,et al.  A Decade of Computing to Traverse the Labyrinth of Protein Domains , 2007 .

[3]  M Gerstein,et al.  Structural proteomics: prospects for high throughput sample preparation. , 2000, Progress in biophysics and molecular biology.

[4]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[5]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[6]  Xiaolong Wang,et al.  Domain boundary prediction based on profile domain linker propensity index , 2006, Comput. Biol. Chem..

[7]  G. Winter,et al.  Identification of protein domains by shotgun proteolysis. , 2006, Journal of molecular biology.

[8]  Albert Y. Zomaya,et al.  Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index , 2006, BMC Bioinformatics.

[9]  Yutaka Kuroda,et al.  Improvement of domain linker prediction by incorporating loop-length-dependent characteristics. , 2006, Biopolymers.

[10]  Peer Bork,et al.  SMART 4.0: towards genomic data integration , 2004, Nucleic Acids Res..

[11]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[12]  Bani K. Mallick,et al.  Prediction of protein interdomain linker regions by a hidden Markov model , 2005, Bioinform..

[13]  Olivier Poch,et al.  Domain architecture of the p62 subunit from the human transcription/repair factor TFIIH deduced by limited proteolysis and mass spectrometry analysis. , 2004, Biochemistry.

[14]  R. Kaptein,et al.  Expression screening, protein purification and NMR analysis of human protein domains for structural genomics , 2004, Journal of Structural and Functional Genomics.

[15]  Daniel W. Udwary,et al.  A method for prediction of the locations of linker regions within large multifunctional proteins, and application to a type I polyketide synthase. , 2002, Journal of molecular biology.

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[18]  Lesheng Kong,et al.  Delineation of modular proteins: Domain boundary prediction from sequence information , 2004, Briefings Bioinform..

[19]  Yutaka Kuroda,et al.  POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions , 2007, Bioinform..

[20]  Yutaka Kuroda,et al.  Computer‐aided NMR assay for detecting natively folded structural domains , 2006, Protein science : a publication of the Protein Society.

[21]  Steven E. Brenner,et al.  Target selection for structural genomics , 2000, Nature Structural Biology.

[22]  Yong-Gang Chang,et al.  Identification, expression, and purification of a unique stable domain from human HSPC144 protein. , 2005, Protein expression and purification.

[23]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[24]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[25]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[26]  J M Thornton,et al.  Domain assignment for protein structures using a consensus approach: Characterization and analysis , 1998, Protein science : a publication of the Protein Society.

[27]  Yutaka Kuroda,et al.  Identification of putative domain linkers by a neural network – application to a large sequence database , 2006, BMC Bioinformatics.

[28]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[29]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[30]  Deborah S Wuttke,et al.  Soluble domains of telomerase reverse transcriptase identified by high‐throughput screening , 2005, Protein science : a publication of the Protein Society.

[31]  Yutaka Kuroda,et al.  Characteristics and prediction of domain linker sequences in multi-domain proteins , 2004, Journal of Structural and Functional Genomics.

[32]  Y. Matsuo,et al.  Automated search of natively folded protein fragments for high‐throughput structure determination in structural genomics , 2000, Protein science : a publication of the Protein Society.

[33]  Yutaka Kuroda,et al.  Characterization and prediction of linker sequences of multi-domain proteins by a neural network , 2004, Journal of Structural and Functional Genomics.

[34]  Amos Bairoch,et al.  Recent improvements to the PROSITE database , 2004, Nucleic Acids Res..

[35]  P. Picotti,et al.  Probing protein structure by limited proteolysis. , 2004, Acta biochimica Polonica.

[36]  Yutaka Kuroda,et al.  Structural genomics projects in Japan , 2000, Nature Structural Biology.

[37]  Jean-Philippe Vert,et al.  A novel representation of protein sequences for prediction of subcellular location using support vector machines , 2005, Protein science : a publication of the Protein Society.