Improved general regression network for protein domain boundary prediction

BackgroundProtein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence.ResultsThe proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy.ConclusionThe performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements.

[1]  Baldomero Oliva,et al.  Protein Loop Classification Using Artificial Neural Networks , 2005, BSB.

[2]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[3]  R. Sauer,et al.  Optimizing the stability of single-chain proteins by linker length and composition mutagenesis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  O. Galzitskaya,et al.  Prediction of protein domain boundaries from sequence alone , 2003, Protein science : a publication of the Protein Society.

[5]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[6]  P. Bork Shuffled domains in extracellular proteins , 1991, FEBS letters.

[7]  David T. Jones,et al.  Rapid protein domain assignment from amino acid sequence using predicted secondary structure , 2002, Protein science : a publication of the Protein Society.

[8]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[9]  Ralf Zimmer,et al.  SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles , 2006, Bioinform..

[10]  S. Meri,et al.  Interdomain contact regions and angles between adjacent short consensus repeat domains. , 2004, Journal of molecular biology.

[11]  Albert Y. Zomaya,et al.  An overview of protein-folding techniques: issues and perspectives , 2005, Int. J. Bioinform. Res. Appl..

[12]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[13]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[14]  George D. C. Cavalcanti,et al.  PCA feature extraction for protein structure prediction , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[15]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ilya N. Shindyalov,et al.  Computational Methods for Domain Partitioning of Protein Structures , 2007 .

[17]  P. Bork,et al.  Protein domain analysis in the era of complete genomes , 2002, FEBS letters.

[18]  C. Hogue,et al.  Armadillo: domain boundary prediction by amino acid composition. , 2005, Journal of molecular biology.

[19]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[22]  Albert Y. Zomaya,et al.  Improving the performance of DomainDiscovery of protein domain boundary assignment using inter-domain linker index , 2006, BMC Bioinformatics.

[23]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[24]  Narendra S. Chaudhari,et al.  Bidirectional segmented-memory recurrent neural network for protein secondary structure prediction , 2006, Soft Comput..

[25]  Simon Parsons,et al.  Bioinformatics: The Machine Learning Approach by P. Baldi and S. Brunak, 2nd edn, MIT Press, 452 pp., $60.00, ISBN 0-262-02506-X , 2004, The Knowledge Engineering Review.

[26]  P. Frasconi,et al.  On the role of long-range dependencies in learning protein secondary structure , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[27]  Rajani R. Joshi,et al.  A Decade of Computing to Traverse the Labyrinth of Protein Domains , 2007 .

[28]  Golan Yona,et al.  Automatic prediction of protein domains from sequence information using a hybrid learning system , 2004, Bioinform..

[29]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[30]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[31]  C. Khosla,et al.  Role of linkers in communication between protein modules. , 2000, Current opinion in chemical biology.

[32]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[33]  James E. Bray,et al.  Assigning genomic sequences to CATH , 2000, Nucleic Acids Res..

[34]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[35]  Chein-I Chang,et al.  Robust radial basis function neural networks , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[36]  Stella Veretnik,et al.  Partitioning protein structures into domains: why is it so difficult? , 2006, Journal of molecular biology.

[37]  Jonathan Lee,et al.  A fuzzy Petri net-based expert system and its application to damage assessment of bridges , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[38]  Michael Sattler,et al.  Structure and dynamics of the human pleckstrin DEP domain: Distinct molecular features of a novel DEP domain subfamily , 2004, Proteins.

[39]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[40]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[41]  Wouter de Laat,et al.  Linker length and composition influence the flexibility of Oct‐1 DNA binding , 1997, The EMBO journal.

[42]  J. Richardson,et al.  The anatomy and taxonomy of protein structure. , 1981, Advances in protein chemistry.

[43]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..