Query optimization for ontology-based information integration

In recent years, there has been an explosion of publicly available RDF and OWL data sources. In order to effectively and quickly answer queries in such an environment, we present an approach to identifying the potentially relevant Semantic Web data sources using query rewritings and a term index. We demonstrate that such an approach must carefully handle query goals that lack constants; otherwise the algorithm may identify many sources that do not contribute to eventual answers. This is because the term index only indicates if URIs are present in a document, and specific answers to a subgoal cannot be calculated until the source is physically accessed - an expensive operation given disk/network latency. We present an algorithm that, given a set of query rewritings that accounts for ontology heterogeneity, incrementally selects and processes sources in order to maintain selectivity. Once sources are selected, we use an OWL reasoner to answer queries over these sources and their corresponding ontologies. We present the results of experiments using both a synthetic data set and a subset of the real-world Billion Triple Challenge data.

[1]  Jie Zhang,et al.  Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data , 2007, ISWC/ASWC.

[2]  Jeff Heflin,et al.  A Scalable Indexing Mechanism for Ontology-Based Information Integration , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  Heiner Stuckenschmidt,et al.  Towards distributed processing of RDF path queries , 2005, Int. J. Web Eng. Technol..

[4]  V. S. Subrahmanian,et al.  GRIN: A Graph Based RDF Index , 2007, AAAI.

[5]  Daniel J. Abadi,et al.  Scalable Semantic Web Data Management Using Vertical Partitioning , 2007, VLDB.

[6]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[7]  Martin L. Kersten,et al.  Column-store support for RDF data management: not all swans are white , 2008, Proc. VLDB Endow..

[8]  Andy Seaborne,et al.  Clustered TDB: A Clustered Triple Store for Jena , 2008 .

[9]  Jeff Heflin,et al.  Efficient Selection and Integration of Data Sources for Answering Semantic Web Queries , 2008, 2008 IEEE International Conference on Semantic Computing.

[10]  Yimin Wang,et al.  A decentralized infrastructure for query answering over distributed ontologies , 2007, SAC '07.

[11]  Jie Zhao,et al.  Schema Mediation in Peer Data Management Systems , 2011, Int. J. Cooperative Inf. Syst..

[12]  Haofen Wang,et al.  Hermes: Data Web search on a pay-as-you-go integration infrastructure , 2009, J. Web Semant..

[13]  Alon Y. Halevy,et al.  Piazza: data management infrastructure for semantic web applications , 2003, WWW '03.

[14]  Carlo Zaniolo,et al.  Optimization of Nonrecursive Queries , 1986, VLDB.

[15]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[16]  Orri Erling,et al.  RDF Support in the Virtuoso DBMS , 2007, CSSW.

[17]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[18]  Tore Risch,et al.  EDUTELLA: a P2P networking infrastructure based on RDF , 2002, WWW.

[19]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[20]  Luciano Serafini,et al.  DRAGO: Distributed Reasoning Architecture for the Semantic Web , 2005, ESWC.