A Global Approach to Comparative Genomics: Comparison of Functional Annotation over the Taxonomic Tree

Genome sequencing projects produce large amounts of data that are stored in sequence databases. Entries in these databases are annotated using the results of different experiments and computational methods. These methods usually rely on homology detection based on sequence similarity searches. Gene Ontology (GO) provides a standard vocabulary of functional terms, and allows a coherent annotation of gene products. These annotations can be used as a basis for new methods that compare gene products on the basis of their molecular function and biological role. In this thesis, we present a new approach for integrating the species taxonomy, protein family classifications and GO annotations. We implemented a database and a client application, GOTaxExplorer, that can be used to perform queries with a simplified language and to process and visualize the results. It allows to compare different taxonomic groups regarding the protein families or the protein functions associated with the different genomes. We developed a method for comparing GO annotations which includes a measure of functional similarity between gene products. The method was able to find functional relationships even if the proteins show no significant sequence similarity. We provide results for different application scenarios, in particular for the identification of new drug targets. I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references. Andreas Schlicker August 30, 2005

[1]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[2]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[3]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[4]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..

[5]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[6]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[7]  Andreas Zell,et al.  A memetic clustering algorithm for the functional partition of genes based on the gene ontology , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[8]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[9]  Olivier Bodenreider,et al.  Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[10]  Lei Qin,et al.  Semantic search among heterogeneous biological databases based on gene ontology. , 2004, Acta biochimica et biophysica Sinica.

[11]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[12]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[13]  Peer Bork,et al.  SMART 4.0: towards genomic data integration , 2004, Nucleic Acids Res..

[14]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[15]  Rolf Apweiler,et al.  Filtering erroneous protein annotation , 2004, ISMB/ECCB.

[16]  Volker Haarslev,et al.  The FungalWeb Ontology The Core of a Semantic Web Application for Fungal Genomics , 2004 .

[17]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[18]  Gary L Gilliland,et al.  Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site , 2003, Proteins.

[19]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[20]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[21]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[22]  N. Grishin,et al.  C‐terminal domain of gyrase A is predicted to have a β‐propeller structure , 2002 .

[23]  N. Grishin,et al.  C-terminal domain of gyrase A is predicted to have a beta-propeller structure. , 2002, Proteins.

[24]  Steffen Schulze-Kremer,et al.  Ontologies for molecular biology and bioinformatics , 2002, Silico Biol..

[25]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[26]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[27]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[28]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD): a monitor of genome projects world-wide , 2001, Nucleic Acids Res..

[29]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[30]  Steffen Schulze-Kremer,et al.  Ontologies for Molecular Biology , 2001, Electron. Trans. Artif. Intell..

[31]  A. Cann Genomes , 2012, Stadler Genetics Symposia Series.

[32]  Rolf Apweiler,et al.  The EBI SRS Server: Recent Developments , 2002, German Conference on Bioinformatics.

[33]  A. Emery,et al.  Human Molecular Genetics 2 , 2000, Neuromuscular Disorders.

[34]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[35]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[36]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse , 2000, Nucleic Acids Res..

[37]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[38]  Kara Dolinski,et al.  Integrating functional genomic information into the Saccharomyces Genome Database , 2000, Nucleic Acids Res..

[39]  Peter B. McGarvey,et al.  The Protein Information Resource (PIR) , 2000, Nucleic Acids Res..

[40]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[41]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[42]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[43]  Rolf Apweiler,et al.  A novel method for automatic functional annotation of proteins , 1999, Bioinform..

[44]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[45]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[46]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Russ B. Altman,et al.  RIBOWEB: Linking Structural Computations to a Knowledge Base of Published Experimental Data , 1997, ISMB.

[48]  M. Madigan,et al.  Brock Biology of Microorganisms , 1996 .

[49]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[50]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[51]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.