Can RDB2RDF Tools Feasibily Expose Large Science Archives for Data Integration?

Many science archive centres publish very large volumes of image, simulation, and experiment data. In order to integrate and analyse the available data, scientists need to be able to (i) identify and locate all the data relevant to their work; (ii) understand the multiple heterogeneous data models in which the data is published; and (iii) interpret and process the data they retrieve. rdf has been shown to be a generally successful framework within which to perform such data integration work. It can be equally successful in the context of scientific data, if it is demonstrably practical to expose that data as rdf . In this paper we investigate the capabilities of rdf to enable the integration of scientific data sources. Specifically, we discuss the suitability of sparql for expressing scientific queries, and the performance of several triple stores and rdbrdf tools for executing queries over a moderately sized sample of a large astronomical data set. We found that more research and improvements are required into sparql and rdbrdf tools to efficiently expose existing science archives for data integration.

[1]  David J. DeWitt,et al.  Locating Data Sources in Large Distributed Systems , 2003, VLDB.

[2]  Christian Bizer,et al.  The Berlin SPARQL Benchmark , 2009, Int. J. Semantic Web Inf. Syst..

[3]  Ivan Jelínek,et al.  Benchmarking RDF Production Tools , 2007, DEXA.

[4]  R. Nichol,et al.  The Fourth Data Release of the Sloan Digital Sky Survey , 2005 .

[5]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[6]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[7]  Robert Mann,et al.  The SuperCOSMOS Science Archive , 2004 .

[8]  Peter Z. Kunszt,et al.  Data Mining the SDSS SkyServer Database , 2002, WDAS.

[9]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[10]  Michael Stonebraker,et al.  THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  Orri Erling,et al.  RDF Support in the Virtuoso DBMS , 2007, CSSW.

[12]  Georg Lausen,et al.  SP^2Bench: A SPARQL Performance Benchmark , 2008, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Ashok Malhotra Progress Report from the RDB2RDF XG , 2008, International Semantic Web Conference.

[14]  Meikel Pöss,et al.  New TPC benchmarks for decision support and web commerce , 2000, SGMD.

[15]  Brian McBride,et al.  Jena: A Semantic Web Toolkit , 2002, IEEE Internet Comput..

[16]  James A. Hendler,et al.  The Semantic Web — ISWC 2002 , 2002, Lecture Notes in Computer Science.