Functional Classification Using Phylogenomic Inference

Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context. As originally outlined by Jonathan Eisen [1–3], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to [1]. One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome [4], based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences [5,6]. Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale [7]—and the challenges we have faced in this effort—motivate this paper. Figure 1 Phylogenomic Analysis of Protein Function Using Subfamily Annotation In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [8–10]. The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected. Evolutionary distance. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species. Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [3,7]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [11,12]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there. This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1). Some of these methods have based annotation transfer on the identification of orthologs [13–15] or of functional subfamilies [6,16–21]. Other groups have used whole-tree analyses [22–24]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families. Table 1 Resources for Phylogenomic Analysis It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole.

[1]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[2]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[3]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[4]  Andrew Rambaut,et al.  Heterotachy and tree building: a case study with plastids and eubacteria. , 2006, Molecular biology and evolution.

[5]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[6]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[7]  Kimmen Sjölander,et al.  SATCHMO: Sequence Alignment and Tree Construction Using Hidden Markov Models , 2003, Bioinform..

[8]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[9]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[10]  Alfonso Valencia,et al.  Clustering of proximal sequence space for the identification of protein families , 2002, Bioinform..

[11]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[12]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[13]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..

[14]  Michael Ashburner,et al.  Assessment of genome-wide protein function classification for Drosophila melanogaster. , 2003, Genome research.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[17]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[18]  J A Eisen,et al.  Evolution of the SNF2 family of proteins: subfamilies with distinct sequences and functions. , 1995, Nucleic acids research.

[19]  Duncan P. Brown,et al.  Subfamily HMMS in Functional Genomics , 2004, Pacific Symposium on Biocomputing.

[20]  Martin Vingron,et al.  A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[21]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[22]  Sean R. Eddy,et al.  RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs , 2002, BMC Bioinformatics.

[23]  P. Babbitt Definitions of enzyme function for the structural genomics era. , 2003, Current opinion in chemical biology.

[24]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[25]  R A Goldstein,et al.  Models of natural mutations including site heterogeneity , 1998, Proteins.

[26]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[27]  M. Ruggero,et al.  Similarity of Traveling-Wave Delays in the Hearing Organs of Humans and Other Tetrapods , 2007, Journal for the Association for Research in Otolaryngology.

[28]  Philip E. Bourne,et al.  Structural Evolution of the Protein Kinase–Like Superfamily , 2005, PLoS Comput. Biol..

[29]  Carl E. Rasmussen,et al.  Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models , 2003, Pacific Symposium on Biocomputing.

[30]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[31]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[32]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[33]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[34]  Michael I. Jordan,et al.  Protein Molecular Function Prediction by Bayesian Phylogenomics , 2005, PLoS Comput. Biol..

[35]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[36]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[37]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[38]  Adam Zemla,et al.  Critical assessment of methods of protein structure prediction (CASP)‐round V , 2005, Proteins.

[39]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[40]  H. Philippe,et al.  Heterotachy, an important process of protein evolution. , 2002, Molecular biology and evolution.

[41]  Jonathan A. Eisen,et al.  Gastrogenomic delights: A movable feast , 1997, Nature Medicine.

[42]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[43]  N Linial,et al.  ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space , 1999, Proteins.

[44]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[45]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.