Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index

BackgroundKnowledge of protein domain boundaries is critical for the characterisation and understanding of protein function. The ability to identify domains without the knowledge of the structure – by using sequence information only – is an essential step in many types of protein analyses. In this present study, we demonstrate that the performance of DomainDiscovery is improved significantly by including the inter-domain linker index value for domain identification from sequence-based information. Improved DomainDiscovery uses a Support Vector Machine (SVM) approach and a unique training dataset built on the principle of consensus among experts in defining domains in protein structure. The SVM was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence.ResultsImproved DomainDiscovery is compared with other methods by benchmarking against a structurally non-redundant dataset and also CASP5 targets. Improved DomainDiscovery achieves 70% accuracy for domain boundary identification in multi-domains proteins.ConclusionImproved DomainDiscovery compares favourably to the performance of other methods and excels in the identification of domain boundaries for multi-domain proteins as a result of introducing support vector machine with benchmark_2 dataset.

[1]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[2]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[3]  O. Galzitskaya,et al.  Prediction of protein domain boundaries from sequence alone , 2003, Protein science : a publication of the Protein Society.

[4]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[5]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[6]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Stella Veretnik,et al.  Partitioning protein structures into domains: why is it so difficult? , 2006, Journal of molecular biology.

[9]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[10]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[11]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[12]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[13]  Ralf Zimmer,et al.  SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles , 2006, Bioinform..

[14]  Albert Y. Zomaya,et al.  An overview of protein-folding techniques: issues and perspectives , 2005, Int. J. Bioinform. Res. Appl..

[15]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[16]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[17]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[19]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[20]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[21]  Lesheng Kong,et al.  Delineation of modular proteins: Domain boundary prediction from sequence information , 2004, Briefings Bioinform..

[22]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.