Emerging practices for mapping and linking life sciences data using RDF - A case series

Members of the W3C Health Care and Life Sciences Interest Group (HCLS IG) have published a variety of genomic and drug-related data sets as Resource Description Framework (RDF) triples. This experience has helped the interest group define a general data workflow for mapping health care and life science (HCLS) data to RDF and linking it with other Linked Data sources. This paper presents the workflow along with four case studies that demonstrate the workflow and addresses many of the challenges that may be faced when creating new Linked Data resources. The first case study describes the creation of linked RDF data from microarray data sets while the second discusses a linked RDF data set created from a knowledge base of drug therapies and drug targets. The third case study describes the creation of an RDF index of biomedical concepts present in unstructured clinical reports and how this index was linked to a drug side-effect knowledge base. The final case study describes the initial development of a linked data set from a knowledge base of small molecules. This paper also provides a detailed set of recommended practices for creating and publishing Linked Data sources in the HCLS domain in such a way that they are discoverable and usable by people, software agents, and applications. These practices are based on the cumulative experience of the Linked Open Drug Data (LODD) task force of the HCLS IG. While no single set of recommendations can address all of the heterogeneous information needs that exist within the HCLS domains, practitioners wishing to create Linked Data should find the recommendations useful for identifying the tools, techniques, and practices employed by earlier developers. In addition to clarifying available methods for producing Linked Data, the recommendations for metadata should also make the discovery and consumption of Linked Data easier.

[1]  Markus Krötzsch,et al.  Semantic MediaWiki , 2006, International Semantic Web Conference.

[2]  Alan Ruttenberg,et al.  MIREOT: The minimum information to reference an external ontology term , 2009, Appl. Ontology.

[3]  Jens Lehmann,et al.  Triplify: light-weight linked data publication from relational databases , 2009, WWW '09.

[4]  Frank van Harmelen,et al.  Exploring large document repositories with RDF technology: the DOPE project , 2004, IEEE Intelligent Systems.

[5]  Robert Stevens,et al.  Developing a kidney and urinary pathway knowledge base , 2011, J. Biomed. Semant..

[6]  Michel Dumontier,et al.  Integrating findings of traditional medicine with modern pharmaceutical research: the potential role of linked open data , 2010, Chinese medicine.

[7]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[8]  Dean Allemang,et al.  Semantic Web for the Working Ontologist - Effective Modeling in RDFS and OWL, Second Edition , 2011 .

[9]  Mark A. Musen,et al.  A System for Ontology-Based Annotation of Biomedical Data , 2008, DILS.

[10]  David M. Shotton,et al.  OpenFlyData: The Way to Go for Biological Data Integration , 2009, DILS.

[11]  Satya S. Sahoo,et al.  A Survey of Current Approaches for Mapping of Relational Databases to RDF , 2009 .

[12]  Steve Battle Gloze : XML to RDF and back again , 2006 .

[13]  Vipul Kashyap,et al.  The Translational Medicine Ontology and Knowledge Base: driving personalized medicine by bridging the gap between bench and bedside , 2011, J. Biomed. Semant..

[14]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[15]  Kate Byrne Having Triplets – Holding Cultural Data as RDF , 2008 .

[16]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[17]  Natalya F. Noy,et al.  BioPortal: Ontologies and Integrated Data Resources at the Click of a Mouse , 2009 .

[18]  C. Bizer,et al.  D2R MAP - A Database to RDF Mapping Language , 2003, WWW.

[19]  Wolfgang Maass,et al.  S3QL: A distributed domain specific language for controlled semantic integration of life sciences data , 2011, BMC Bioinformatics.

[20]  Tom Heath,et al.  How to Publish Linked Data on the Web - Proposal for a Half-day Tutorial at ISWC2008 , 2008 .

[21]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[22]  C. Steinbeck,et al.  The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web , 2011, PloS one.

[23]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[24]  M. Scott Marshall,et al.  Provenance of Microarray Experiments for a Better Understanding of Experiment Results , 2010, SWPM@ISWC.

[25]  Martha Larson,et al.  Information Access to Cultural Heritage Workshop Report: ECDL 2008, Aarhus Denmark, 18 September 2008 , 2008 .

[26]  Michael Hausenblas,et al.  Describing linked datasets with the VoID vocabulary , 2011 .

[27]  Domenico Beneventano,et al.  Semantic Web Search Engines: the SEWASIE approach , 2003 .

[28]  Olaf Hartig,et al.  Publishing and Consuming Provenance Metadata on the Web of Linked Data , 2010, IPAW.

[29]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[30]  M. Scott Marshall,et al.  Tutorial: Query Federation with SWObjects , 2011 .

[31]  C. J. Date SQL and Relational Theory - How to Write Accurate SQL Code, Second Edition , 2012, Theory in practice.

[32]  Holger Knublauch,et al.  The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications , 2004, SEMWEB.

[33]  Terrence A. Brooks,et al.  Review of: Allemang, Dean and Hendler, James. Semantic Web for the working ontologist: modeling in RDF, RDFS and OWL. Burlington, MA: Morgan Kaufmann, 2008 , 2009, Inf. Res..

[34]  Renée J. Miller,et al.  Linking Semistructured Data on the Web , 2011, WebDB.

[35]  Kei-Hoi Cheung,et al.  Linking Open Drug Data , 2009, I-SEMANTICS.

[36]  Thomas Keays,et al.  Semantic Web for the Working Ontologist , 2008 .

[37]  Egon L. Willighagen,et al.  Linking the Resource Description Framework to cheminformatics and proteochemometrics , 2011, J. Biomed. Semant..

[38]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[39]  G. Hanumantha Rao,et al.  Web Search Engine , 2011 .

[40]  Christian Becker,et al.  Extending SMW+ with a Linked Data Integration Framework , 2010, ISWC Posters&Demos.

[41]  Vipul Kashyap,et al.  The Translational Medicine Ontology: Driving personalized medicine by bridging the gap from bedside to bench , 2010 .

[42]  Wendy A. Warr,et al.  ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI) , 2009, J. Comput. Aided Mol. Des..

[43]  Leo Sauermann,et al.  Cool URIs for the semantic web , 2007 .

[44]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[45]  Martin Gaedke,et al.  Silk - A Link Discovery Framework for the Web of Data , 2009, LDOW.

[46]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..