DAW: Duplicate-AWare Federated Query Processing over the Web of Data

Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the effect of duplicated data on federated querying. This work presents DAW, a novel duplicate-aware approach to federated querying over the Web of Data. DAW is based on a combination of min-wise independent permutations and compact data summaries. It can be directly combined with existing federated query engines in order to achieve the same query recall values while querying fewer data sources. We extend three well-known federated query processing engines — DARQ, SPLENDID, and FedX — with DAW and compare our extensions with the original approaches. The comparison shows that DAW can greatly reduce the number of queries sent to the endpoints, while keeping high query recall values. Therefore, it can significantly improve the performance of federated query processing engines. Moreover, DAW provides a source selection mechanism that maximises the query recall, when the query processing is limited to a subset of the sources.

[1]  Torsten Grust,et al.  Advances in database technology - EDBT 2006 : 10th International Conference on Extending Database Technology, Munich, Germany, March 2006; proceedings , 2006 .

[2]  Jens Lehmann,et al.  DBpedia SPARQL Benchmark - Performance Assessment with Real Queries on Real Data , 2011, SEMWEB.

[3]  Gerhard Weikum,et al.  Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets , 2009, TOCS.

[4]  Ian Horrocks,et al.  The Semantic Web – ISWC 2010: 9th International Semantic Web Conference, ISWC 2010, Shanghai, China, November 7-11, 2010, Revised Selected Papers, Part I , 2010, SEMWEB.

[5]  Katja Hose,et al.  FedX: Optimization Techniques for Federated Query Processing on Linked Data , 2011, SEMWEB.

[6]  Gerhard Weikum,et al.  IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking , 2006, EDBT.

[7]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[8]  Maribel Acosta,et al.  ANAPSID: An Adaptive Query Processing Engine for SPARQL Endpoints , 2011, SEMWEB.

[9]  Abraham Bernstein,et al.  Avalanche: Putting the Spirit of the Web back into Semantic Web Querying , 2010, ISWC Posters&Demos.

[10]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[11]  Yossi Matias,et al.  Fractional XSketch Synopses for XML Databases , 2004, XSym.

[12]  Ulf Leser,et al.  Querying Distributed RDF Data Sources with SPARQL , 2008, ESWC.

[13]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[14]  Steffen Staab,et al.  SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions , 2011, COLD.

[15]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[16]  Subbarao Kambhampati,et al.  BibFinder/StatMiner: Effectively Mining and Using Coverage and Overlap Statistics in Data Integration , 2003, VLDB.

[17]  Katja Hose,et al.  Towards benefit-based RDF source selection for SPARQL queries , 2012, SWIM '12.

[18]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[19]  Günter Ladwig,et al.  Linked Data Query Processing Strategies , 2010, SEMWEB.

[20]  Jeff Heflin,et al.  Using Reformulation Trees to Optimize Queries over Distributed Heterogeneous Sources , 2010, International Semantic Web Conference.

[21]  Dan Suciu,et al.  Database and XML Technologies , 2004, Lecture Notes in Computer Science.

[22]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[23]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[24]  Neoklis Polyzotis,et al.  Statistical synopses for graph-structured XML databases , 2002, SIGMOD '02.

[25]  Wolfram Wöß,et al.  A Semantic Web middleware for Virtual Data Integration on the Web , 2008, ESWC.

[26]  Lora Aroyo,et al.  The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I , 2011, SEMWEB.

[27]  Jürgen Umbrich,et al.  Data summaries for on-demand queries over linked data , 2010, WWW '10.

[28]  Jens Lehmann,et al.  Introduction to Linked Data and Its Lifecycle on the Web , 2013, Reasoning Web.