Phenotype Data: A Neglected Resource in Biomedical Research?

To a great extent, our phenotype is determined by our genetic material. Many genotypic modifications may ultimately become manifest in more or less pronounced changes in phenotype. Despite the importance of how specific genetic alterations contribute to the development of diseases, surprisingly little effort has been made towards exploiting systematically the current knowledge of genotype-phenotype relationships. In the past, genes were characterized with the help of so-called "forward genetics" studies in model organisms, relating a given phenotype to a genetic modification. Analogous studies in higher organisms were hampered by the lack of suitable high-throughput genetic methods. This situation has now changed with the advent of new screening methods, especially RNA interference (RNAi) which allows to specifically silence gene by gene and to observe the phenotypic outcome. This ongoing large-scale characterization of genes in mammalian in-vitro model systems will increase phenotypic information exponentially in the very near future. But will our knowledge grow equally fast? As in other scientific areas, data integration is a key problem. It is thus still a major bioinformatics challenge to interpret the results of large-scale functional screens, even more so if sets of heterogeneous data are to be combined. It is now time to develop strategies to structure and use these data in order to transform the wealth of information into knowledge and, eventually, into novel therapeutic approaches. In light of these developments, we thoroughly surveyed the available phenotype resources and reviewed different approaches to analyzing their content. We discuss hurdles yet to be overcome, i.e. the lack of data integration, the missing adequate phenotype ontologies and the shortage of appropriate analytical tools. This review aims to assist researchers keen to understand and make effective use of these highly valuable data.

[1]  Yves A. Lussier,et al.  Terminological Mapping for High Throughput Comparative Biology of Phenotypes , 2003, Pacific Symposium on Biocomputing.

[2]  Y. Dong,et al.  Systematic functional analysis of the Caenorhabditis elegans genome using RNAi , 2003, Nature.

[3]  M. Cantor,et al.  Mining OMIM for insight into complex diseases. , 2004, Studies in health technology and informatics.

[4]  Andrew G. Clark,et al.  Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels , 2005, J. Comput. Biol..

[5]  Carol Friedman,et al.  Visualizing information across multidimensional post-genomic structured and textual databases , 2005, Bioinform..

[6]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[7]  G. D. Zhou,et al.  Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid , 2006, Int. J. Medical Informatics.

[8]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[9]  T. Gerats,et al.  Forward genetics and map-based cloning approaches. , 2003, Trends in plant science.

[10]  C. Lindberg The Unified Medical Language System (UMLS) of the National Library of Medicine. , 1990, Journal.

[11]  John M. Hancock,et al.  CRAVE: a database, middleware and visualization system for phenotype ontologies , 2005, Bioinform..

[12]  Sarah A. Douglas,et al.  The Zebrafish Information Network (ZFIN): a resource for genetic, genomic and developmental research , 2001, Nucleic Acids Res..

[13]  Hans-Dieter Pohlenz,et al.  PhenomicDB: a multi-species genotype/phenotype database for comparative phenomics , 2005, Bioinform..

[14]  Marc Vidal,et al.  Systematic analysis of genes required for synapse structure and function , 2005, Nature.

[15]  Dana C Crawford,et al.  Definition and clinical importance of haplotypes. , 2005, Annual review of medicine.

[16]  Robert L. Scot Drysdale,et al.  Phenotypic Data in FlyBase , 2001, Briefings Bioinform..

[17]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[18]  F. Piano,et al.  Gene Clustering Based on RNAi Phenotypes of Ovary-Enriched Genes in C. elegans , 2002, Current Biology.

[19]  Thierry Soussi,et al.  UMD (Universal Mutation Database): 2005 update , 2005, Human mutation.

[20]  Kei-Hoi Cheung,et al.  TRIPLES: a database of gene function in Saccharomyces cerevisiae , 2000, Nucleic Acids Res..

[21]  Kei-Hoi Cheung,et al.  The TRIPLES database: a community resource for yeast molecular biology , 2002, Nucleic Acids Res..

[22]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[23]  David B. Searls,et al.  Data integration: challenges for drug discovery , 2005, Nature Reviews Drug Discovery.

[24]  A. Coulson,et al.  Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans , 2005, Nature.

[25]  C Béroud,et al.  UMD (Universal Mutation Database): A generic software to build and analyze locus‐specific databases , 2000, Human mutation.

[26]  J Frezal,et al.  Human genes involved in chromatin remodeling in transcription initiation, and associated diseases: An overview using the GENATLAS database. , 1999, Molecular genetics and metabolism.

[27]  E. Cook,et al.  PharmGKB Update: II. CYP3A5, Cytochrome P450, Family 3, Subfamily A, Polypeptide 5 , 2004, Pharmacological Reviews.

[28]  Ourania Horaitis,et al.  Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. , 2002, Genome research.

[29]  C R Scriver,et al.  Monogenic traits are not simple: lessons from phenylketonuria. , 1999, Trends in genetics : TIG.

[30]  Guenter Stoesser,et al.  Rat Genome Database (RGD) , 2004 .

[31]  Julia M. Kim,et al.  Description of a drug hierarchy in a concept-based reference terminology , 2001, AMIA.

[32]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[33]  Juancarlos Chan,et al.  WormBase: a cross-species database for comparative genomics , 2003, Nucleic Acids Res..

[34]  Russ B. Altman,et al.  Ontology Development for a Pharmacogenetics Knowledge Base , 2001, Pacific Symposium on Biocomputing.

[35]  Tim J. P. Hubbard,et al.  Biological information: making it accessible and integrated (and trying to make sense of it) , 2002, ECCB.

[36]  D. Fredman,et al.  HGVbase: a curated resource describing human DNA variation and phenotype relationships , 2004, Nucleic Acids Res..

[37]  Kent A Spackman,et al.  SNOMED CT milestones: endorsements are added to already-impressive standards credentials. , 2004, Healthcare informatics : the business magazine for information and communication systems.

[38]  Kent A. Spackman,et al.  The Use of SNOMED© CT Simplifies Querying of a Clinical Data Warehouse , 2003, AMIA.

[39]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[40]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology , 2004, Nucleic Acids Res..

[41]  Giulio Tononi,et al.  Reduced sleep in Drosophila Shaker mutants , 2005, Nature.

[42]  John M. Hancock,et al.  Using ontologies to describe mouse phenotypes , 2004, Genome Biology.

[43]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[44]  Anne E Carpenter,et al.  Cell microarrays and RNA interference chip away at gene function , 2005, Nature Genetics.

[45]  T. Tuschl,et al.  Small interfering RNAs: a revolutionary tool for the analysis of gene function and gene therapy. , 2002, Molecular interventions.

[46]  R. Bernards,et al.  A Genetic Screen Identifies PITX1 as a Suppressor of RAS Activity and Tumorigenicity , 2005, Cell.

[47]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[48]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[49]  Lynne Prevost,et al.  PAHdb 2003: What a locus‐specific knowledgebase can do , 2003, Human mutation.

[50]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[51]  Ezgi O. Booth,et al.  Epistasis analysis with global transcriptional phenotypes , 2005, Nature Genetics.

[52]  Peter J. Tonellato,et al.  The Rat Genome Database (RGD): developments towards a phenome database , 2004, Nucleic Acids Res..

[53]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): integration nexus for the laboratory mouse , 2001, Nucleic Acids Res..

[54]  Kent A. Spackman,et al.  The SNOMED RT Procedure Model , 2001, AMIA.

[55]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[56]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[57]  Andrew D. Yates,et al.  A screen of the complete protein kinase gene family identifies diverse patterns of somatic mutations in human breast cancer , 2005, Nature Genetics.

[58]  K M Kudla,et al.  SNOMED takes the next step. , 2001, Journal of AHIMA.

[59]  John M. Hancock,et al.  Building Mouse Phenotype Ontologies , 2003, Pacific Symposium on Biocomputing.

[60]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[61]  George P Patrinos,et al.  DNA, diseases and databases: disastrously deficient. , 2005, Trends in genetics : TIG.

[62]  Monte Westerfield,et al.  The Zebrafish Information Network (ZFIN): the zebrafish model organism database , 2003, Nucleic Acids Res..

[63]  H. Nijhout Development and evolution of adaptive polyphenisms , 2003, Evolution & development.

[64]  Hans-Werner Mewes,et al.  MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[65]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[66]  Lincoln D Stein,et al.  WormBase as an integrated platform for the C. elegans ORFeome. , 2004, Genome research.

[67]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[68]  Paul W. Sternberg,et al.  WormBase: network access to the genome and biology of Caenorhabditis elegans , 2001, Nucleic Acids Res..

[69]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[70]  Kent A. Spackman,et al.  Compositional concept representation using SNOMED: towards further convergence of clinical terminologies , 1998, AMIA.

[71]  Andrew G Fraser,et al.  Genome-Wide RNAi of C. elegans Using the Hypersensitive rrf-3 Strain Reveals Novel Gene Functions , 2003, PLoS biology.

[72]  P. Zipperlen,et al.  Functional genomic analysis of C. elegans chromosome I by systematic RNA interference , 2000, Nature.

[73]  Carlos Alberto Heuser,et al.  Integrating Biological Databases , 2003, SBBD.

[74]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[75]  Miguel A. Andrade-Navarro,et al.  Gene annotation from scientific literature using mappings between keyword systems , 2004, Bioinform..

[76]  Timos K. Sellis,et al.  Incremental Design of a Data Warehouse , 2004, Journal of Intelligent Information Systems.

[77]  Frank W. Nicholas,et al.  Online Mendelian Inheritance in Animals (OMIA): a comparative knowledgebase of genetic disorders and other familial traits in non-laboratory animals , 2003, Nucleic Acids Res..

[78]  B. Dujon,et al.  European functional analysis network (EUROFAN) and the functional analysis of the Saccharomyces cerevisiae genome (minireview) , 1998, Electrophoresis.

[79]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[80]  U. Grossniklaus,et al.  The art and design of genetic screens: Arabidopsis thaliana , 2002, Nature Reviews Genetics.

[81]  D. Cooper,et al.  Human Gene Mutation Database , 1996, Human Genetics.

[82]  C. Scriver,et al.  The hyperphenylalaninemias of man and mouse. , 1994, Annual review of genetics.

[83]  T. Sejnowski,et al.  Machine learning approaches for phenotype-genotype mapping: predicting heterozygous mutations in the CYP21B gene from steroid profiles. , 2005, European journal of endocrinology.

[84]  Yan P. Yuan,et al.  HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources , 2002, Nucleic Acids Res..

[85]  V. Vapnik,et al.  Bounds on Error Expectation for Support Vector Machines , 2000, Neural Computation.

[86]  S. Oliver A network approach to the systematic analysis of yeast gene function. , 1996, Trends in genetics : TIG.

[87]  R. Altman,et al.  PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. , 2005, Methods in molecular biology.

[88]  Thessa T. J. P. Kockelkorn,et al.  Mediator expression profiling epistasis reveals a signal transduction pathway with antagonistic submodules and highly specific downstream targets. , 2005, Molecular cell.

[89]  Joshua M. Stuart,et al.  Integrating genotype and phenotype information: an overview of the PharmGKB project , 2001, The Pharmacogenomics Journal.

[90]  Madeline A. Crosby,et al.  FlyBase: genes and gene models , 2004, Nucleic Acids Res..

[91]  Mathew W. Wright,et al.  The HUGO Gene Nomenclature Committee (HGNC) , 2001, Human Genetics.

[92]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[93]  Anders Blomberg,et al.  PROPHECY—a database for high-resolution phenomics , 2005, Nucleic Acids Res..

[94]  Yang Shi,et al.  Mammalian RNAi for the masses. , 2003, Trends in genetics : TIG.

[95]  J. Groffen,et al.  Different clinical manifestations of hyperphenylalaninemia in three siblings with identical phenylalanine hydroxylase genes. , 1991, American journal of human genetics.

[96]  Conrad C. Huang,et al.  PharmGKB Update: I. Genetic Variants of the Organic Cation Transporter 2 (OCT2, SLC22A2) , 2003, Pharmacological Reviews.

[97]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[98]  Petra Ross-Macdonald,et al.  Model systems in drug discovery: chemical genetics meets genomics. , 2003, Pharmacology & therapeutics.

[99]  Ourania Horaitis,et al.  The challenge of documenting mutation across the genome: The human genome variation society approach , 2004, Human mutation.

[100]  J. Frézal,et al.  Genatlas database, genes and development defects. , 1998, Comptes rendus de l'Academie des sciences. Serie III, Sciences de la vie.

[101]  Thomas Tuschl,et al.  Functional genomics: RNA sets the standard , 2003, Nature.

[102]  R. Verhaak,et al.  Prognostically useful gene-expression profiles in acute myeloid leukemia. , 2004, The New England journal of medicine.

[103]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[104]  Michael Boutros,et al.  Genome-wide RNAi as a route to gene function in Drosophila. , 2004, Briefings in functional genomics & proteomics.

[105]  B. Shastry SNPs and haplotypes: genetic markers for disease and drug response (review). , 2003, International journal of molecular medicine.

[106]  D. Searls,et al.  Managing genomic and proteomic knowledge. , 2005, Drug discovery today. Technologies.

[107]  E E Schadt,et al.  A new paradigm for drug discovery: integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets. , 2003, Biochemical Society transactions.

[108]  Monte Westerfield,et al.  The Zebrafish Information Network: the zebrafish model organism database , 2005, Nucleic Acids Res..

[109]  David E Hill,et al.  Toward improving Caenorhabditis elegans phenome mapping with an ORFeome-based RNAi library. , 2004, Genome research.

[110]  Janan T. Eppig,et al.  A mouse phenome project , 2000, Mammalian Genome.

[111]  Kent A. Spackman,et al.  The SNOMED clinical terms development process: refinement and analysis of content , 2002, AMIA.

[112]  P. Waters How PAH gene mutations cause hyper‐phenylalaninemia and why mechanism matters: Insights from in vitro expression , 2003, Human mutation.

[113]  Ulf Leser,et al.  Federated Information Systems: Concepts, Terminology and Architectures , 2007 .

[114]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2004, Nucleic Acids Res..

[115]  Sebastian A. Leidel,et al.  Functional genomic analysis of cell division in C. elegans using RNAi of genes on chromosome III , 2000, Nature.

[116]  Kristin C. Gunsalus,et al.  RNAiDB and PhenoBlast: web tools for genome-wide phenotypic mapping projects , 2004, Nucleic Acids Res..

[117]  P. Stenson,et al.  Human Gene Mutation Database—A biomedical information and research resource , 2000, Human mutation.

[118]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[119]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[120]  Norbert Perrimon,et al.  Parallel Chemical Genetic and Genome-Wide RNAi Screens Identify Cytokinesis Inhibitors and Targets , 2004, PLoS biology.

[121]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): integrating biology with the genome , 2004, Nucleic Acids Res..

[122]  Kimberly Van Auken,et al.  WormBase: a comprehensive data resource for Caenorhabditis biology and genomics , 2004, Nucleic Acids Res..

[123]  Judith A. Blake,et al.  MGD: the Mouse Genome Database , 2003, Nucleic Acids Res..

[124]  Kimberly Van Auken,et al.  WormBase: a multi-species resource for nematode biology and genomics , 2004, Nucleic Acids Res..

[125]  W. Frankel,et al.  Moving forward with chemical mutagenesis in the mouse , 2004, The Journal of physiology.

[126]  B L Humphreys,et al.  The UMLS project: making the conceptual connection between users and the information they need. , 1993, Bulletin of the Medical Library Association.

[127]  Marc Vidal,et al.  Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis , 2005, Nature.

[128]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[129]  S. Aymé Bridging the gap between molecular genetics and metabolic medicine: access to genetic information , 2000, European Journal of Pediatrics.

[130]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[131]  Jay Snoddy,et al.  Large-scale mutagenesis of the mouse to understand the genetic bases of nervous system structure and function. , 2004, Brain research. Molecular brain research.

[132]  Kent A. Spackman,et al.  Mapping between SNOMED RT and Clinical terms version 3: a key component of the SNOMED CT development process , 2001, AMIA.

[133]  H. Jacob,et al.  Tools and strategies for physiological genomics: the Rat Genome Database. , 2005, Physiological genomics.

[134]  François Stricher,et al.  SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs , 2004, Nucleic Acids Res..

[135]  C. Nüsslein-Volhard,et al.  Mutations affecting segment number and polarity in Drosophila , 1980, Nature.

[136]  Nirupam Sarkar,et al.  Improved fractal geometry based texture segmentation technique , 1993 .

[137]  Gary A. Churchill,et al.  A collaborative database of inbred mouse strain characteristics , 2004, Bioinform..

[138]  L. Chin,et al.  A Genetic Screen for Candidate Tumor Suppressors Identifies REST , 2005, Cell.

[139]  R. Cotton,et al.  In vivo disposal of phenylalanine in phenylketonuria: A study of two siblings , 1996, Journal of Inherited Metabolic Disease.

[140]  Norbert Perrimon,et al.  Genome-wide high-throughput screens in functional genomics. , 2004, Current opinion in genetics & development.

[141]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[142]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[143]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): the model organism database for the laboratory mouse , 2002, Nucleic Acids Res..

[144]  Rainer Spang,et al.  Non-transcriptional pathway features reconstructed from secondary effects of RNA interference , 2005, Bioinform..

[145]  Mary Hawking,et al.  Bringing SNOMED-CT into use within primary care. , 2005, Informatics in primary care.

[146]  D. Weatherall,et al.  From genotype to phenotype: genetics and medical practice in the new millennium. , 1999, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[147]  Eric M. Just,et al.  dictyBase: a new Dictyostelium discoideum genome database , 2004, Nucleic Acids Res..

[148]  Conrad C. Huang,et al.  PharmGKB Update: III. Genetic Variants of SLC22A1, Solute Carrier Family 22 (Organic Cation Transporter), Member 1 , 2004, Pharmacological Reviews.

[149]  Peter J. Tonellato,et al.  Rat Genome Database (RGD): mapping disease onto the genome , 2002, Nucleic Acids Res..

[150]  Susan H Fenton An introduction to the Unified Medical Language System. , 2005, Journal of AHIMA.

[151]  J. Downward Use of RNA interference libraries to investigate oncogenic signalling in mammalian cells , 2004, Oncogene.

[152]  W. Johannsen,et al.  The Genotype Conception of Heredity , 1911, The American Naturalist.