Adapting resources to the Semantic Web: Experience with Entrez Gene

Modern biomedical research is increasingly supported by information technologies. Biologists and physicians rely not only on the biomedical literature (e.g., MEDLINE), but also on the many knowledge bases available online (e.g., through the National Center for Biotechnology Information’s (NCBI) Entrez portal). While these resources are undeniably valuable to humans, most of them are text-based and heterogeneous, and cannot be easily processed by computers. For example, the information retrieved from sources like Entrez Gene (EG) [2] or the Online Mendelian Inheritance in Man (OMIM) [4] is represented in XML but follows different data type definitions (DTD). Hence, queries across the different NCBI data sources are only possible through implementation of complex linkages. Moreover, within one data source namely EG, a traditional relational database schema makes it extremely difficult to query for information using the relationships between the concepts. The Biomedical Knowledge Repository (BKR) under development at the National Library of Medicine addresses these limitations. It consists of an extensive collection of normalized assertions (i.e., concept-relationship-concept triples), represented in a common format, and processable by computers. Therefore, it can be understood as a specialized version of the Semantic Web. In this paper, we describe a pilot contribution to the BKR: the transformation of the EG database into the W3C Resource Description Framework (RDF) format [5].

[1]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[2]  Brian McBride,et al.  Jena: A Semantic Web Toolkit , 2002, IEEE Internet Comput..

[3]  S. Ravada,et al.  Object Type and Reification in Oracle , 2005 .

[4]  Nicole Alexander,et al.  RDF Object Type and Reification in the Database , 2006, 22nd International Conference on Data Engineering (ICDE'06).