Protocols to capture the functional plasticity of protein domain superfamilies

Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function.

[1]  Alfonso Valencia,et al.  Early bioinformatics: the birth of a discipline - a personal view , 2003, Bioinform..

[2]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[3]  J. Bains,et al.  Structural and Biophysical Characterization of BoxC from Burkholderia xenovorans LB400 , 2009, The Journal of Biological Chemistry.

[4]  Janet M. Thornton,et al.  Comparison of functional annotation schemes for genomes , 2000, Functional & Integrative Genomics.

[5]  V. Hwa,et al.  The Insulin-like Growth Factor-binding Protein (igfbp) Superfamily* , 2022 .

[6]  W R Taylor,et al.  SSAP: sequential structure alignment program for protein structure comparison. , 1996, Methods in enzymology.

[7]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[8]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[9]  C. Chothia,et al.  The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity , 2003, Development.

[10]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[11]  Dorothea Emig,et al.  Partitioning biological data with transitivity clustering , 2010, Nature Methods.

[12]  Jaques Reifman,et al.  Genome‐wide enzyme annotation with precision control: Catalytic families (CatFam) databases , 2009, Proteins.

[13]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[14]  M. Riley,et al.  Functions of the gene products of Escherichia coli , 1993, Microbiological reviews.

[15]  J. Gready,et al.  The C‐type lectin‐like domain superfamily , 2005, The FEBS journal.

[16]  B. Dickson,et al.  Distinct Protein Domains and Expression Patterns Confer Divergent Axon Guidance Functions for Drosophila Robo Receptors , 2010, Cell.

[17]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[18]  Cathy H. Wu,et al.  PIRSF Family Classification System for Protein Functional and Evolutionary Analysis , 2006, Evolutionary bioinformatics online.

[19]  Jian Zhang,et al.  The Protein Ontology: a structured representation of protein forms and complexes , 2010, Nucleic Acids Res..

[20]  Jianwu Wang,et al.  Facilitating e-Science Discovery Using Scientific Workflows on the Grid , 2011, Guide to e-Science.

[21]  Yu Rang Park,et al.  GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products , 2011, BMC Bioinformatics.

[22]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[23]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[24]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[25]  Manuel C. Peitsch,et al.  SWISS-MODEL: an automated protein homology-modeling server , 2003, Nucleic Acids Res..

[26]  Richard J. Roberts,et al.  COMBREX: a project to accelerate the functional annotation of prokaryotic genomes , 2010, Nucleic Acids Res..

[27]  S. Teichmann,et al.  The relationship between domain duplication and recombination. , 2005, Journal of molecular biology.

[28]  Youping Deng,et al.  Recent advances in clustering methods for protein interaction networks , 2010, BMC Genomics.

[29]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Christian E. V. Storm,et al.  Comprehensive analysis of orthologous protein domains using the HOPS database. , 2003, Genome research.

[31]  M. Robinson‐Rechavi,et al.  How confident can we be that orthologs are similar, but paralogs differ? , 2009, Trends in genetics : TIG.

[32]  Gary J. Olsen,et al.  Aminoacyl-tRNA Synthetases, the Genetic Code, and the Evolutionary Process , 2000, Microbiology and Molecular Biology Reviews.

[33]  P E Bourne,et al.  An alternative view of protein fold space , 2000, Proteins.

[34]  Michael Maibaum,et al.  Survey of current protein family databases and their application in comparative, structural and functional genomics. , 2005, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[35]  Phil Carter,et al.  The Gene3D Web Services: a platform for identifying, annotating and comparing structural domains in protein sequences , 2011, Nucleic Acids Res..

[36]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2007: families and functions , 2006, Nucleic Acids Res..

[37]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[38]  M. Sternberg,et al.  Automated prediction of protein function and detection of functional sites from structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[40]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[41]  Mark L. Blaxter,et al.  annot8r: GO, EC and KEGG annotation of EST datasets , 2008, BMC Bioinformatics.

[42]  Gert Vriend,et al.  Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[43]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Evolutionary versatility of eukaryotic protein domains revealed by their bigram networks , 2011, BMC Evolutionary Biology.

[45]  Michael G. Rossmann,et al.  Chemical and biological evolution of a nucleotide-binding protein , 1974, Nature.

[46]  Christine A. Orengo,et al.  A fast and automated solution for accurately resolving protein domain architectures , 2010, Bioinform..

[47]  J. Skou The influence of some cations on an adenosine triphosphatase from peripheral nerves. , 1998, Journal of the American Society of Nephrology : JASN.

[48]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[49]  Predrag Radivojac,et al.  Testing the Ortholog Conjecture with Comparative Functional Genomic Data from Mammals , 2011, PLoS Comput. Biol..

[50]  Jinjiang Fan,et al.  Bovine seminal plasma proteins and their relatives: A new expanding superfamily in mammals. , 2006, Gene.

[51]  P. Schimmel,et al.  Aminoacyl-tRNA synthetases: potential markers of genetic code development. , 2001, Trends in biochemical sciences.

[52]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[53]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[54]  L. Patthy Genome evolution and the evolution of exon-shuffling--a review. , 1999, Gene.

[55]  Paul D. Thomas,et al.  GIGA: a simple, efficient algorithm for gene tree inference in the genomic age , 2010, BMC Bioinformatics.

[56]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[57]  Conrad C. Huang,et al.  Representing Structure-Function Relationships in Mechanistically Diverse Enzyme Superfamilies , 2004, Pacific Symposium on Biocomputing.

[58]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[59]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[60]  Lawrence P. Wackett,et al.  Melamine Deaminase and Atrazine Chlorohydrolase: 98 Percent Identical but Functionally Different , 2001, Journal of bacteriology.

[61]  G. Cheng,et al.  Apoptosis Induced by Cytoskeletal Disruption Requires Distinct Domains of MEKK1 , 2011, PloS one.

[62]  Terry Gaasterland,et al.  DarkHorse: a method for genome-wide prediction of horizontal gene transfer , 2007, Genome Biology.

[63]  E. Koonin,et al.  Orthology, paralogy and proposed classification for paralog subtypes. , 2002, Trends in genetics : TIG.

[64]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[65]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[66]  Paul Schimmel,et al.  Transiently misacylated tRNA is a primer for editing of misactivated adenylates by class I aminoacyl-tRNA synthetases. , 2003, Biochemistry.

[67]  Phillip W. Lord,et al.  Semantic Similarity in Biomedical Ontologies , 2009, PLoS Comput. Biol..

[68]  Nick V Grishin,et al.  Sequence and structure classification of kinases. , 2002, Journal of molecular biology.

[69]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[70]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[71]  N. Grishin,et al.  PROCAIN: protein profile comparison with assisting information , 2009, Nucleic acids research.

[72]  John P. Overington,et al.  How many drug targets are there? , 2006, Nature Reviews Drug Discovery.

[73]  M. Saier,et al.  Bioinformatic Characterization of P-Type ATPases Encoded Within the Fully Sequenced Genomes of 26 Eukaryotes , 2009, Journal of Membrane Biology.

[74]  O. Nureki,et al.  Structural basis for amino acid and tRNA recognition by class I aminoacyl-tRNA synthetases. , 2001, Cold Spring Harbor symposia on quantitative biology.

[75]  Monica Riley Searchlight on domains. , 2007, Structure.

[76]  E V Koonin,et al.  Evolution of aminoacyl-tRNA synthetases--analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. , 1999, Genome research.

[77]  Dannie Durand,et al.  Domain Architecture Comparison for Multidomain Homology Identification , 2007, J. Comput. Biol..

[78]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[79]  M. Gerstein,et al.  Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. , 2001, Journal of molecular biology.

[80]  D G Vassylyev,et al.  Enzyme structure with two catalytic sites for double-sieve selection of substrate. , 1998, Science.

[81]  H. Clevers,et al.  Ancestry and diversity of the HMG box superfamily. , 1993, Nucleic acids research.

[82]  N. Rupke Richard Owen's Vertebrate Archetype , 1993, Isis.

[83]  Barry Honig,et al.  Is protein classification necessary? Toward alternative approaches to function annotation. , 2009, Current opinion in structural biology.

[84]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[85]  Philip E. Bourne,et al.  The Evolutionary History of Protein Domains Viewed by Species Phylogeny , 2009, PloS one.

[86]  F Yang,et al.  Using affinity propagation combined post-processing to cluster protein sequences. , 2010, Protein and peptide letters.

[87]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[88]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[89]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[90]  Russell F. Doolittle,et al.  “Homology” in proteins and nucleic acids: A terminology muddle and a way out of it , 1987, Cell.

[91]  Shane C. Burgess,et al.  Re-Annotation Is an Essential Step in Systems Biology Modeling of Functional Genomics Data , 2010, PloS one.

[92]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[93]  S. Dongen A cluster algorithm for graphs , 2000 .

[94]  Jeffrey C. Hall,et al.  The cryb Mutation Identifies Cryptochrome as a Circadian Photoreceptor in Drosophila , 1998, Cell.

[95]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[96]  William R Taylor,et al.  Evolutionary transitions in protein fold space. , 2007, Current opinion in structural biology.

[97]  Thomas Rattei,et al.  SIMAP—structuring the network of protein similarities , 2007, Nucleic Acids Res..

[98]  Elaine C. Meng,et al.  Evolution of Function in the “Two Dinucleotide Binding Domains” Flavoproteins , 2007, PLoS Comput. Biol..

[99]  F. van Roy,et al.  Phylogenetic analysis of the cadherin superfamily allows identification of six major subfamilies besides several solitary members. , 2000, Journal of molecular biology.

[100]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[101]  M. Nakasako,et al.  Crystal structure of the calcium pump of sarcoplasmic reticulum at 2.6 Å resolution , 2000, Nature.

[102]  Gustavo Caetano-Anollés,et al.  Reductive evolution of proteomes and protein structures , 2011, Proceedings of the National Academy of Sciences.

[103]  Kimmen Sjölander,et al.  INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification , 2008, Bioinform..

[104]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[105]  Z. Tümer,et al.  Mutation spectrum of ATP7A, the gene defective in Menkes disease. , 1999, Advances in experimental medicine and biology.

[106]  P. Pandolfi,et al.  A ceRNA Hypothesis: The Rosetta Stone of a Hidden RNA Language? , 2011, Cell.

[107]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[108]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[109]  Kai F. Müller,et al.  PlantTribes: a gene and gene family resource for comparative genomics in plants , 2007, Nucleic Acids Res..

[110]  Srinivas Aluru,et al.  Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[111]  Shashi B. Pandit,et al.  SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[112]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[113]  C. Toyoshima,et al.  Soluble P‐type ATPase from an archaeon, Methanococcus jannaschii , 2000, FEBS letters.

[114]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[115]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[116]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[117]  C. Rensing,et al.  CopA: An Escherichia coli Cu(I)-translocating P-type ATPase. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[118]  Ying Xu,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology & Bioinformatics.

[119]  Harvey T. McMahon,et al.  The dynamin superfamily: universal membrane tubulation and fission molecules? , 2004, Nature Reviews Molecular Cell Biology.

[120]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[121]  Jérôme Gouzy,et al.  ProDom: Automated Clustering of Homologous Domains , 2002, Briefings Bioinform..

[122]  Günter P. Wagner,et al.  Protein Structural Modularity and Robustness Are Associated with Evolvability , 2011, Genome biology and evolution.

[123]  Min-Sung Kim,et al.  COFECO: composite function annotation enriched by protein complex data , 2009, Nucleic Acids Res..