Functional classification of CATH superfamilies: a domain-based approach for protein function annotation

Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact: sayoni.das.12@ucl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[2]  C. Chothia,et al.  The generation of new protein functions by the combination of domains. , 2007, Structure.

[3]  Abhijit Chakraborty,et al.  A survey on prediction of specificity-determining sites in proteins , 2015, Briefings Bioinform..

[4]  Benoit H. Dessailly,et al.  Functional site plasticity in domain superfamilies☆ , 2013, Biochimica et biophysica acta.

[5]  Jürgen Pleiss,et al.  The Thiamine diphosphate dependent Enzyme Engineering Database: A tool for the systematic analysis of sequence and structure relations , 2010, BMC Biochemistry.

[6]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[7]  Benoit H. Dessailly,et al.  Exploiting structural classifications for function prediction: towards a domain grammar for protein function. , 2009, Current opinion in structural biology.

[8]  K. Sjölander,et al.  PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification , 2006, Genome Biology.

[9]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[10]  Elisabeth Coudert,et al.  HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot , 2008, Nucleic Acids Res..

[11]  Michael Levitt,et al.  Evolutionarily consistent families in SCOP: sequence, structure and function , 2012, BMC Structural Biology.

[12]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[13]  Nathan Linial,et al.  Entropy-driven partitioning of the hierarchical protein space , 2014, Bioinform..

[14]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[15]  Michael A. Hicks,et al.  The Structure–Function Linkage Database , 2013, Nucleic Acids Res..

[16]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[17]  Celine Vens,et al.  Top-Down Clustering for Protein Subfamily Identification , 2013, Evolutionary bioinformatics online.

[18]  Gail J. Bartlett,et al.  Analysis of catalytic residues in enzyme active sites. , 2002, Journal of molecular biology.

[19]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[20]  Erin Beck,et al.  TIGRFAMs and Genome Properties in 2013 , 2012, Nucleic Acids Res..

[21]  Christine A. Orengo,et al.  A fast and automated solution for accurately resolving protein domain architectures , 2010, Bioinform..

[22]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[23]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[24]  Patricia C. Babbitt,et al.  Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space , 2013, PLoS Comput. Biol..

[25]  C. Sander,et al.  Determinants of protein function revealed by combinatorial entropy optimization , 2007, Genome Biology.

[26]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[27]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[28]  Anton J. Enright,et al.  MagicMatch - cross-referencing sequence identifiers across databases , 2005, Bioinform..

[29]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[30]  P. Babbitt,et al.  Evolution of enzyme superfamilies. , 2006, Current opinion in chemical biology.

[31]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[32]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[33]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[34]  Martin Madera,et al.  Profile Comparer: a program for scoring and aligning profile hidden Markov models , 2008, Bioinform..

[35]  Di Wu,et al.  Bioinformatics analysis of the epitope regions for norovirus capsid protein , 2013, BMC Bioinformatics.

[36]  Alfonso Valencia,et al.  Protein interactions and ligand binding: From protein subfamilies to functional specificity , 2010, Proceedings of the National Academy of Sciences.

[37]  W. Kruskal Historical Notes on the Wilcoxon Unpaired Two-Sample Test , 1957 .

[38]  Robert B. Russell,et al.  An automated stochastic approach to the identification of the protein specificity determinants and functional subfamilies , 2010, Algorithms for Molecular Biology.

[39]  Christine A. Orengo,et al.  Protein function prediction using domain families , 2013, BMC Bioinformatics.

[40]  Jeffrey Heer,et al.  SpanningAspectRatioBank Easing FunctionS ArrayIn ColorIn Date Interpolator MatrixInterpola NumObjecPointI Rectang ISchedu Parallel Pause Scheduler Sequen Transition Transitioner Transiti Tween Co DelimGraphMLCon IData JSONCon DataField DataSc Dat DataSource Data DataUtil DirtySprite LineS RectSprite , 2011 .

[41]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[42]  Narmada Thanki,et al.  CDD: NCBI's conserved domain database , 2014, Nucleic Acids Res..

[43]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[44]  Erik L. L. Sonnhammer,et al.  FunShift: a database of function shift analysis on protein subfamilies , 2004, Nucleic Acids Res..

[45]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[46]  Cyrus Chothia,et al.  SUPERFAMILY 1.75 including a domain-centric gene ontology method , 2010, Nucleic Acids Res..

[47]  David A. Lee,et al.  Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis , 2013, Nucleic Acids Res..

[48]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[49]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[50]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[51]  Mona Singh,et al.  Characterization and prediction of residues determining protein functional specificity , 2008, Bioinform..

[52]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.