Extending Protein Domain Boundary Predictors to Detect Discontinuous Domains

A variety of protein domain predictors were developed to predict protein domain boundaries in recent years, but most of them cannot predict discontinuous domains. Considering nearly 40% of multidomain proteins contain one or more discontinuous domains, we have developed DomEx to enable domain boundary predictors to detect discontinuous domains by assembling the continuous domain segments. Discontinuous domains are predicted by matching the sequence profile of concatenated continuous domain segments with the profiles from a single-domain library derived from SCOP and CATH, and Pfam. Then the matches are filtered by similarity to library templates, a symmetric index score and a profile-profile alignment score. DomEx recalled 32.3% discontinuous domains with 86.5% precision when tested on 97 non-homologous protein chains containing 58 continuous and 99 discontinuous domains, in which the predicted domain segments are within ±20 residues of the boundary definitions in CATH 3.5. Compared with our recently developed predictor, ThreaDom, which is the state-of-the-art tool to detect discontinuous-domains, DomEx recalled 26.7% discontinuous domains with 72.7% precision in a benchmark with 29 discontinuous-domain chains, where ThreaDom failed to predict any discontinuous domains. Furthermore, combined with ThreaDom, the method ranked number one among 10 predictors. The source code and datasets are available at https://github.com/xuezhidong/DomEx.

[1]  Peng Chen,et al.  Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers , 2010, BMC Structural Biology.

[2]  Ramanathan Sowdhamini,et al.  DIAL: a web-based server for the automatic identification of structural domains in proteins , 2005, Nucleic Acids Res..

[3]  Xin Deng,et al.  DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning , 2011, BMC Bioinformatics.

[4]  Benoit H. Dessailly,et al.  Detailed analysis of function divergence in a large and diverse domain superfamily: toward a refined protocol of function classification. , 2010, Structure.

[5]  Pierre Baldi,et al.  DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks , 2006, Data Mining and Knowledge Discovery.

[6]  Nathan Linial,et al.  EVEREST: a collection of evolutionary conserved protein domains , 2006, Nucleic Acids Res..

[7]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[8]  Woei-Jyh Lee,et al.  Evaluation of domain prediction in CASP6 , 2005, Proteins.

[9]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[10]  Yutaka Kuroda,et al.  Computer‐aided NMR assay for detecting natively folded structural domains , 2006, Protein science : a publication of the Protein Society.

[11]  Ian Sillitoe,et al.  Extending CATH: increasing coverage of the protein structure universe and linking structure with function , 2010, Nucleic Acids Res..

[12]  Yutaka Kuroda,et al.  DROP: an SVM domain linker predictor trained with optimal features selected by random forest , 2011, Bioinform..

[13]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[14]  Jianpeng Ma,et al.  OPUS-Dom: applying the folding-based method VECFOLD to determine protein domain boundaries. , 2009, Journal of molecular biology.

[15]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[16]  B. Rost,et al.  Sequence-based prediction of protein domains. , 2004, Nucleic acids research.

[17]  Jooyoung Lee,et al.  PPRODO: Prediction of protein domain boundaries using neural networks , 2005, Proteins.

[18]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[19]  Yang Zhang Interplay of I‐TASSER and QUARK for template‐based and ab initio protein structure prediction in CASP10 , 2014, Proteins.

[20]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[21]  Ying Xu,et al.  Protein domain decomposition using a graph-theoretic approach , 2000, Bioinform..

[22]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[23]  Liisa Holm,et al.  ADDA: a domain database with global coverage of the protein universe , 2004, Nucleic Acids Res..

[24]  Ian Sillitoe,et al.  The CATH Hierarchy Revisited—Structural Divergence in Domain Superfamilies and the Continuity of Fold Space , 2009, Structure.

[25]  G J Barton,et al.  Continuous and discontinuous domains: An algorithm for the automatic generation of reliable protein domain definitions , 1995, Protein science : a publication of the Protein Society.

[26]  Anders Wallqvist,et al.  FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator , 2008, Nucleic acids research.

[27]  Yang Zhang,et al.  Intra-chain 3D segment swapping spawns the evolution of new multidomain protein architectures. , 2012, Journal of molecular biology.

[28]  Peer Bork,et al.  SMART: identification and annotation of domains from signalling and extracellular protein sequences , 1999, Nucleic Acids Res..

[29]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[30]  Geoffrey J. Barton,et al.  3Dee: a database of protein structural domains , 2001, Bioinform..

[31]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[32]  R. Kaptein,et al.  Expression screening, protein purification and NMR analysis of human protein domains for structural genomics , 2004, Journal of Structural and Functional Genomics.

[33]  Sitao Wu,et al.  LOMETS: A local meta-threading-server for protein structure prediction , 2007, Nucleic acids research.

[34]  Albert Y. Zomaya,et al.  Inferring Boundary Information of Discontinuous-Domain Proteins , 2008, IEEE Transactions on NanoBioscience.

[35]  W R Taylor,et al.  Protein structural domain identification. , 1999, Protein engineering.

[36]  R. A. George,et al.  Snapdragon: a Method to Delineate Protein Structural Domains from Sequence Data , 2022 .

[37]  Lars Malmström,et al.  PROTEINS: Structure, Function, and Bioinformatics Suppl 7:193–200 (2005) Automated Prediction of Domain Boundaries in CASP6 Targets Using Ginzu and RosettaDOM , 2022 .

[38]  Albert Y. Zomaya,et al.  DomNet: Protein Domain Boundary Prediction Using Enhanced General Regression Network and New Profiles , 2008, IEEE Transactions on NanoBioscience.

[39]  Osamu Ohara,et al.  DomCut: prediction of inter-domain linker regions in amino acid sequences , 2003, Bioinform..

[40]  Gabrielle A. Reeves,et al.  Structural diversity of domain superfamilies in the CATH database. , 2006, Journal of molecular biology.

[41]  Yang Zhang Progress and challenges in protein structure prediction. , 2008, Current opinion in structural biology.

[42]  Nathan Linial,et al.  EVEREST: automatic identification and classification of protein domains in all protein sequences , 2006, BMC bioinformatics.

[43]  Paul A. Bates,et al.  Domain Fishing: a first step in protein comparative modelling , 2002, Bioinform..

[44]  Dong Xu,et al.  ThreaDom: extracting protein domain boundary information from multiple threading alignments , 2013, Bioinform..

[45]  Ilya N. Shindyalov,et al.  PDP: protein domain parser , 2003, Bioinform..

[46]  Stephen H. Bryant,et al.  Domain size distributions can predict domain boundaries , 2000, Bioinform..