Mining phenotypes for gene function prediction

BackgroundHealth and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships.ResultsWe present results on a study where we use a large set of phenotype data – in textual form – to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations.ConclusionThe intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.

[1]  J. C. Li,et al.  Development in DROSOPHILA MELANOGASTER. , 1927, Genetics.

[2]  J GaultonKyle,et al.  A computational system to select candidate genes for complex human traits , 2007 .

[3]  F. Piano,et al.  Gene Clustering Based on RNAi Phenotypes of Ovary-Enriched Genes in C. elegans , 2002, Current Biology.

[4]  Philip Groth and Bertram Weiss,et al.  Phenotype Data: A Neglected Resource in Biomedical Research? , 2006 .

[5]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[6]  Janet M. Thornton,et al.  Comparison of functional annotation schemes for genomes , 2000, Functional & Integrative Genomics.

[7]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[8]  R. Karp,et al.  Conserved pathways within bacteria and yeast as revealed by global protein network alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[9]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[10]  John R. Carlson,et al.  Integrating the Molecular and Cellular Basis of Odor Coding in the Drosophila Antenna , 2003, Neuron.

[11]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[12]  Andreas Keller,et al.  Decoding olfaction in Drosophila , 2003, Current Opinion in Neurobiology.

[13]  V. Mermall,et al.  The 95F unconventional myosin is required for proper organization of the Drosophila syncytial blastoderm , 1995, The Journal of cell biology.

[14]  Ulf Leser,et al.  High-Precision Function Prediction using Conserved Interactions , 2007, German Conference on Bioinformatics.

[15]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[16]  Yang Shi,et al.  Mammalian RNAi for the masses. , 2003, Trends in genetics : TIG.

[17]  N. Ramachandra,et al.  A direct screen identifies new flight muscle mutants on the Drosophila second chromosome. , 1999, Genetics.

[18]  B. Snel,et al.  Function prediction and protein networks. , 2003, Current opinion in cell biology.

[19]  Hans-Dieter Pohlenz,et al.  PhenomicDB: a multi-species genotype/phenotype database for comparative phenomics , 2005, Bioinform..

[20]  M. Goddeeris,et al.  Behavioral responses to odorants in drosophila require nervous system expression of the beta integrin gene myospheroid. , 2006, Chemical senses.

[21]  C. R. Scriver,et al.  After the genome—the phenome? , 2004, Journal of Inherited Metabolic Disease.

[22]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[23]  Karen L. Mohlke,et al.  Data and text mining A computational system to select candidate genes for complex human traits , 2007 .

[24]  G. Hannon RNA interference : RNA , 2002 .

[25]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[26]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[27]  Richard Axel,et al.  An Olfactory Sensory Map in the Fly Brain , 2000, Cell.

[28]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[29]  R M Cripps,et al.  Recovery of dominant, autosomal flightless mutants of Drosophila melanogaster and identification of a new gene required for normal muscle structure and function. , 1994, Genetics.

[30]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[31]  Gregory J. Hannon,et al.  Insight Review Articles , 2022 .

[32]  Michael Boutros,et al.  Genome-wide RNAi as a route to gene function in Drosophila. , 2004, Briefings in functional genomics & proteomics.

[33]  Eric SanJuan,et al.  Text mining without document context , 2006, Inf. Process. Manag..

[34]  K. Kellerman,et al.  An unconventional myosin heavy chain gene from Drosophila melanogaster , 1992, The Journal of cell biology.

[35]  Georgi Georgiev,et al.  PhenomicDB: a new cross-species genotype/phenotype resource , 2006, Nucleic Acids Res..

[36]  P Chambon,et al.  EMPReSS: standardized phenotype screens for functional annotation of the mouse genome , 2005, Nature Genetics.

[37]  George Karypis,et al.  Data clustering in life sciences , 2005, Molecular biotechnology.

[38]  Ronald L. Davis,et al.  Molecular biology and anatomy of Drosophila olfactory associative learning , 2001, BioEssays : news and reviews in molecular, cellular and developmental biology.

[39]  E. Wieschaus,et al.  Female sterile mutations on the second chromosome of Drosophila melanogaster. I. Maternal effect mutations. , 1989, Genetics.

[40]  Anton Yuryev,et al.  Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks , 2007, BMC Bioinformatics.

[41]  Marc Vidal,et al.  Systematic analysis of genes required for synapse structure and function , 2005, Nature.

[42]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[43]  Marc Vidal,et al.  Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis , 2005, Nature.

[44]  Gaurav Pandey,et al.  Computational Approaches for Protein Function Prediction : A Survey , 2006 .

[45]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[46]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[47]  Norbert Perrimon,et al.  Parallel Chemical Genetic and Genome-Wide RNAi Screens Identify Cytokinesis Inhibitors and Targets , 2004, PLoS biology.

[48]  Kristin C. Gunsalus,et al.  RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects , 2004, Nucleic Acids Res..

[49]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[50]  Hai Hu,et al.  Assessing semantic similarity measures for the characterization of human regulatory pathways , 2006, Bioinform..

[51]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[52]  Mário J. Silva,et al.  Measuring semantic similarity between Gene Ontology terms , 2007, Data Knowl. Eng..

[53]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[54]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[55]  John M. Hancock,et al.  BIOINFORMATICS APPLICATIONS NOTE Databases and ontologies EMPReSS: European Mouse Phenotyping Resource for Standardized Screens , 2005 .