Protein function annotation using protein domain family resources.

As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap. This article reviews the current approaches of protein function prediction using structure and sequence based classification of protein domain family resources with a special focus on functional families in the CATH-Gene3D resource.

[1]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[2]  R. Russell,et al.  Analysis and prediction of functional sub-types from protein sequence alignments. , 2000, Journal of molecular biology.

[3]  D. Kihara,et al.  PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data , 2009, Proteins.

[4]  Andreas Martin Lisewski,et al.  Protein function prediction: towards integration of similarity metrics. , 2011, Current opinion in structural biology.

[5]  Boris Hayete,et al.  GOTrees: Predicting GO Associations from Protein Domain Composition Using Decision Trees , 2004, Pacific Symposium on Biocomputing.

[6]  Florencio Pazos,et al.  Concomitant prediction of function and fold at the domain level with GO-based profiles , 2013, BMC Bioinformatics.

[7]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[8]  Nathan Linial,et al.  Entropy-driven partitioning of the hierarchical protein space , 2014, Bioinform..

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  David A. Lee,et al.  Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. , 2009, Journal of molecular biology.

[12]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[13]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[14]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[15]  K. Sjölander,et al.  PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification , 2006, Genome Biology.

[16]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[17]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[18]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[19]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[20]  Constance Jeffery,et al.  Moonlighting proteins , 2010, Genome Biology.

[21]  Daisuke Kihara,et al.  ESG: Extended Similarity Group method for automated protein function prediction , 2008 .

[22]  Piero Fariselli,et al.  How to inherit statistically validated annotation within BAR+ protein clusters , 2013, BMC Bioinformatics.

[23]  Robert Petryszak,et al.  The predictive power of the CluSTr database , 2005, Bioinform..

[24]  Erik L. L. Sonnhammer,et al.  FunShift: a database of function shift analysis on protein subfamilies , 2004, Nucleic Acids Res..

[25]  Christine A. Orengo,et al.  A fast and automated solution for accurately resolving protein domain architectures , 2010, Bioinform..

[26]  Hai Fang,et al.  A domain-centric solution to functional genomics via dcGO Predictor , 2013, BMC Bioinformatics.

[27]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[28]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[29]  Benoit H. Dessailly,et al.  Detailed analysis of function divergence in a large and diverse domain superfamily: toward a refined protocol of function classification. , 2010, Structure.

[30]  Brian Henderson,et al.  Bacterial Virulence in the Moonlight: Multitasking Bacterial Moonlighting Proteins Are Virulence Determinants in Infectious Disease , 2011, Infection and Immunity.

[31]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[32]  Patricia C. Babbitt,et al.  Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space , 2013, PLoS Comput. Biol..

[33]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[34]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[35]  Enrique Querol,et al.  Bioinformatics and Moonlighting Proteins , 2015, Front. Bioeng. Biotechnol..

[36]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[37]  Owen White,et al.  The TIGRFAMs database of protein families , 2003, Nucleic Acids Res..

[38]  Michael Kohl,et al.  Cytoscape: software for visualization and analysis of biological networks. , 2011, Methods in molecular biology.

[39]  Christine A. Orengo,et al.  Protein function prediction using domain families , 2013, BMC Bioinformatics.

[40]  C. Orengo,et al.  Protein folds and functions. , 1998, Structure.

[41]  Enrique Querol,et al.  Do protein-protein interaction databases identify moonlighting proteins? , 2011, Molecular bioSystems.

[42]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[43]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[44]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[45]  P. Babbitt,et al.  Divergent Evolution in Enolase Superfamily: Strategies for Assigning Functions* , 2011, The Journal of Biological Chemistry.

[46]  Nathan Linial,et al.  ProtoNet: charting the expanding universe of protein sequences , 2013, Nature Biotechnology.

[47]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[48]  David A. Lee,et al.  Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis , 2013, Nucleic Acids Res..

[49]  Frances M. G. Pearl,et al.  CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures , 2007, PLoS Comput. Biol..

[50]  Martin Madera,et al.  Profile Comparer: a program for scoring and aligning profile hidden Markov models , 2008, Bioinform..

[51]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[52]  W. S. Valdar,et al.  Scoring residue conservation , 2002, Proteins.

[53]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[54]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[55]  David A. Lee,et al.  CATH FunFHMMer web server: protein functional annotations using functional family assignments , 2015, Nucleic Acids Res..

[56]  C. Chothia,et al.  The generation of new protein functions by the combination of domains. , 2007, Structure.

[57]  Todd Ae,et al.  Evolution of function in protein superfamilies. , 2001 .

[58]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[59]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[60]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[61]  Elisabeth Coudert,et al.  HAMAP in 2013, new developments in the protein family classification and annotation system , 2012, Nucleic Acids Res..

[62]  Ian Sillitoe,et al.  The CATH Hierarchy Revisited—Structural Divergence in Domain Superfamilies and the Continuity of Fold Space , 2009, Structure.

[63]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[64]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[65]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[66]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[67]  Narmada Thanki,et al.  CDD: conserved domains and protein three-dimensional structure , 2012, Nucleic Acids Res..

[68]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[69]  M. Fares,et al.  Chaperonin 60: a paradoxical, evolutionarily conserved protein family with multiple moonlighting functions , 2013, Biological reviews of the Cambridge Philosophical Society.

[70]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[71]  Michael A. Hicks,et al.  The Structure–Function Linkage Database , 2013, Nucleic Acids Res..

[72]  Yuxing Liao,et al.  ECOD: An Evolutionary Classification of Protein Domains , 2014, PLoS Comput. Biol..

[73]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[74]  Enrique Querol,et al.  MultitaskProtDB: a database of multitasking proteins , 2013, Nucleic Acids Res..

[75]  Hai Fang,et al.  The SUPERFAMILY 1.75 database in 2014: a doubling of data , 2014, Nucleic Acids Res..