Defining functional distance using manifold embeddings of gene ontology annotations

Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules.

[1]  Liisa Holm,et al.  Identification of homology in protein structure classification , 2001, Nature Structural Biology.

[2]  M. Delarue,et al.  Structure of phenylalanyl-tRNA synthetase from Thermus thermophilus , 1995, Nature Structural Biology.

[3]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[4]  Anil K. Jain,et al.  A modified Hausdorff distance for object matching , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[5]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[6]  C. Ponting,et al.  The natural history of protein domains. , 2002, Annual review of biophysics and biomolecular structure.

[7]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[8]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[9]  Michael Lappe,et al.  A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 , 2001, Nucleic Acids Res..

[10]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[11]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[12]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[13]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[14]  Matthew Brand,et al.  Continuous nonlinear dimensionality reduction by kernel Eigenmaps , 2003, IJCAI.

[15]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[16]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[17]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[18]  Philip E. Bourne,et al.  A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm , 2001, Nucleic Acids Res..

[19]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[20]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[21]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[22]  T. Steitz,et al.  Crystal structure of the site-specific recombinase gamma delta resolvase complexed with a 34 bp cleavage site. , 1996, Cell.

[23]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[24]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[25]  O. Nureki,et al.  Crystal structure of the CENP‐B protein–DNA complex: the DNA‐binding domains of CENP‐B induce kinks in the CENP‐B box DNA , 2001, The EMBO journal.

[26]  Boris E. Shakhnovich,et al.  Improving the Precision of the Structure–Function Relationship by Considering Phylogenetic Context , 2005, PLoS Comput. Biol..

[27]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: multiscale methods. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[29]  Sarah E. Ades,et al.  Engrailed (Gln50-->Lys) homeodomain-DNA complex at 1.9 A resolution: structural basis for enhanced affinity and altered specificity. , 1997, Structure.

[30]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[31]  Wei Yang,et al.  Crystal structure of the site-specific recombinase γδ resolvase complexed with a 34 by cleavage site , 1995, Cell.

[32]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[33]  Guillermo Sapiro,et al.  A Theoretical and Computational Framework for Isometry Invariant Recognition of Point Cloud Data , 2005, Found. Comput. Math..

[34]  R Giegé,et al.  The 2.0 A crystal structure of Thermus thermophilus methionyl-tRNA synthetase reveals two RNA-binding modules. , 2000, Structure.

[35]  Boris E Shakhnovich,et al.  Quantifying structure-function uncertainty: a graph theoretical exploration into the origins and limitations of protein annotation. , 2004, Journal of molecular biology.