Integrating and Warehousing Liver Gene Expression Data and Related Biomedical Resources in GEDAW

Researchers at the medical research institute Inserm U522, specialized in the liver, use high throughput technologies to diagnose liver disease states. They seek to identify the set of dysregulated genes in different physiopathological situations, along with the molecular regulation mechanisms involved in the occurrence of these diseases, leading at mid-term to new diagnostic and therapeutic tools. To be able to resolve such a complex question, one has to consider both data generated on the genes by in-house transcriptome experiments and annotations extracted from the many publicly available heterogeneous resources in Biomedicine. This paper presents GEDAW, a gene expression data warehouse that has been developed to assist such discovery processes. The distinctive feature of GEDAW is that it systematically integrates gene information from a multitude of structured data sources. Data sources include: i) XML records of GENBANK to annotate gene sequence features, integrated using a schema mapping approach, ii) an inhouse relational database that stores detailed experimental data on the liver genes and is a permanent source for providing expression levels to the warehouse without unnecessary details on the experiments, and iii) a semi-structured data source called BioMeKE-XML that provides for each gene its nomenclature, its functional annotation according to Gene Ontology, and its medical annotation according to the UMLS. Because GEDAW is a liver gene expression data warehouse, we have paid more attention to the medical knowledge to be able to correlate biology mechanisms and medical knowledge with experimental data. The paper discusses the data sources and the transformation process that is applied to resolve syntactic and semantic conflicts between the source format and the GEDAW schema.

[1]  Émilie GUERIN,et al.  Deployment of heterogeneous resources of genomic , biological and medical knowledge on the liver to build a datawarehouse , 2003 .

[2]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[3]  Olivier Bodenreider,et al.  Aligning Knowledge Sources in the UMLS: Methods, Quantitative Results, and Applications , 2004, MedInfo.

[4]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[5]  Vipul Kashyap,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, The VLDB Journal.

[6]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[7]  Mathew W. Wright,et al.  The HUGO Gene Nomenclature Committee (HGNC) , 2001, Human Genetics.

[8]  Anita Burgun-Parenthoine,et al.  BioMeKe : an ontology-based biomedical knowledge extraction system devoted to transcriptome analysis , 2003, MIE.

[9]  Olivier Bodenreider,et al.  Representation of roles in biomedical ontologies: a case study in functional genomics , 2002, AMIA.

[10]  Martin Vingron,et al.  Microarray data warehouse allowing for inclusion of experiment annotations in statistical analysis , 2002, Bioinform..

[11]  Shengli Wu,et al.  GIMS-a data warehouse for storage and analysis of genome sequence and functional data , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[12]  John Quackenbush,et al.  A guide to microarray experiments-an open letter to the scientific journals , 2002, The Lancet.

[13]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[14]  Laks V. S. Lakshmanan,et al.  On the Logical Foundations of Schema Integration and Evolution in Heterogeneous Database Systems , 1993, DOOD.

[15]  Erhard Rahm,et al.  Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach , 2004, EDBT.

[16]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[17]  Emmanuel Barillot,et al.  XML, bioinformatics and data integration , 2001, Bioinform..

[18]  Carole A. Goble,et al.  Conceptual modelling of genomic information , 2000, Bioinform..

[19]  Duccio Cavalieri,et al.  A guide to microarray experiments—an open letter to the scientific journals , 2002, The Lancet.

[20]  Olivier Bodenreider,et al.  Evaluation of the UMLS as a terminology and knowledge resource for biomedical informatics , 2002, AMIA.

[21]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[22]  Jan Chomicki,et al.  Hippo: A System for Computing Consistent Answers to a Class of SQL Queries , 2004, EDBT.

[23]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.