CATH: an expanded resource to predict protein function through structure and sequence

The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.1, http://www.cathdb.info). The resource comprises over 300 000 domain structures and over 53 million protein domains classified into 2737 homologous superfamilies, doubling the number of predicted protein domains in the previous version. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 100 000 additional protein domains. This article describes developments to the CATH-Gene3D resource over the last two years since the publication in 2015, including: significant increases to our structural and sequence coverage; expansion of the functional families in CATH; building a support vector machine (SVM) to automatically assign domains to superfamilies; improved search facilities to return alignments of query sequences against multiple sequence alignments; the redesign of the web pages and download site.

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  David A. Lee,et al.  Novel Computational Protocols for Functionally Classifying and Characterising Serine Beta-Lactamases , 2016, PLoS Comput. Biol..

[3]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[4]  Robert D. Finn,et al.  HMMER web server: 2015 update , 2015, Nucleic Acids Res..

[5]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[6]  David S. Goodsell,et al.  The RCSB Protein Data Bank: views of structural biology for basic and applied research and education , 2014, Nucleic Acids Res..

[7]  David T. Jones,et al.  Improving classification in protein structure databases using text mining , 2009, BMC Bioinformatics.

[8]  Burkhard Rost,et al.  MSAViewer: interactive JavaScript visualization of multiple sequence alignments , 2016, Bioinform..

[9]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[10]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[11]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Michael Kohl,et al.  Cytoscape: software for visualization and analysis of biological networks. , 2011, Methods in molecular biology.

[14]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[15]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[16]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[17]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[18]  Janet M. Thornton,et al.  Large-Scale Analysis Exploring Evolution of Catalytic Machineries and Mechanisms in Enzyme Superfamilies , 2016, Journal of molecular biology.

[19]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[20]  Martin Madera,et al.  Profile Comparer: a program for scoring and aligning profile hidden Markov models , 2008, Bioinform..

[21]  Nicholas B Rego,et al.  3Dmol.js: molecular visualization with WebGL , 2014, Bioinform..

[22]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[23]  David A. Lee,et al.  Gene3D: expanding the utility of domain assignments , 2015, Nucleic Acids Res..

[24]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[25]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[26]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[27]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.