DomSVR: domain boundary prediction with support vector regression from sequence information alone

Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classification. In this paper, we propose a support vector regression-based method to address the problem of protein domain boundary identification based on novel input profiles extracted from AAindex database. As a result, our method achieves an average sensitivity of ∼36.5% and an average specificity of ∼81% for multi-domain protein chains, which is overall better than the performance of published approaches to identify domain boundary. As our method used sequence information alone, our method is simpler and faster.

[1]  G. Edelman ANTIBODY STRUCTURE AND MOLECULAR IMMUNOLOGY * , 1971, Annals of the New York Academy of Sciences.

[2]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[3]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[4]  S. Rackovsky,et al.  Differential Geometry and Polymer Conformation. 1. Comparison of Protein Conformations1a,b , 1978 .

[5]  S. Rackovsky,et al.  Differential geometry and polymer conformation. 4. Conformational and nucleation properties of individual amino acids , 1982 .

[6]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical ϕ–ψ matrices: Comparison with experimental scales , 1994 .

[7]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. , 1994, Proteins.

[8]  Remo Guidieri Res , 1995, RES: Anthropology and Aesthetics.

[9]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[10]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[11]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[12]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[13]  B. Williams,et al.  Structure of the double‐stranded RNA‐binding domain of the protein kinase PKR reveals the molecular basis of its dsRNA‐mediated activation , 1998, The EMBO journal.

[14]  K. Stachura-Suchoples Modern methods of data analysis in diatom studies , 1999 .

[15]  R. Jernigan,et al.  Self‐consistent estimation of inter‐residue protein contact energies based on an equilibrium mixture approximation of residues , 1999, Proteins.

[16]  M. Karplus,et al.  Native proteins are surface-molten solids: application of the Lindemann criterion for the solid versus liquid state. , 1999, Journal of molecular biology.

[17]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[18]  K. Nishikawa,et al.  Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. , 2001, Journal of molecular biology.

[19]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[20]  I. Jolliffe Principal Component Analysis , 2002 .

[21]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[22]  R. A. George,et al.  Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.

[23]  P. Bork,et al.  Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[24]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[25]  Giorgio Valle,et al.  PRIMEX: rapid identification of oligonucleotide matches in whole genomes , 2003, Bioinform..

[26]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[27]  O. Galzitskaya,et al.  Prediction of protein domain boundaries from sequence alone , 2003, Protein science : a publication of the Protein Society.

[28]  L. Holm,et al.  Exhaustive enumeration of protein domain families. , 2003, Journal of molecular biology.

[29]  Lars Malmström,et al.  Automated prediction of CASP‐5 structures using the Robetta server , 2003, Proteins.

[30]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[31]  R. A. George,et al.  Predicting protein structural domain boundaries from sequence data. , 2003 .

[32]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[33]  Thomas Lengauer,et al.  Arby: automatic protein structure prediction using profile-profile alignment and confidence measures , 2004, Bioinform..

[34]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[35]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[36]  Harpreet Kaur Saini,et al.  BIOINFORMATICS APPLICATIONS NOTE Structural bioinformatics Meta-DP: domain prediction meta-server , 2022 .

[37]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[38]  Liam J. McGuffin,et al.  Protein structure prediction servers at University College London , 2005, Nucleic Acids Res..

[39]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[40]  Albert Y. Zomaya,et al.  Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index , 2006, BMC Bioinformatics.

[41]  Ralf Zimmer,et al.  SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles , 2006, Bioinform..

[42]  M. Y. Lobanov,et al.  Prediction of number and position of domain boundaries in multi-domain proteins by use of amino acid sequence alone. , 2007, Current protein & peptide science.

[43]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[44]  Hau-San Wong,et al.  Prediction of protein B-factors using multi-class bounded SVM. , 2007, Protein and peptide letters.

[45]  Zhaohui Wu,et al.  Sequence‐based protein domain boundary prediction using BP neural network with various property profiles , 2008, Proteins.

[46]  Albert Y. Zomaya,et al.  Improved general regression network for protein domain boundary prediction , 2007, BMC Bioinformatics.

[47]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[48]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.