STREAK: An Efficient Engine for Processing Top-k SPARQL Queries with Spatial Filters

The importance of geo-spatial data in critical applications such as emergency response, transportation, agriculture etc., has prompted the adoption of recent GeoSPARQL standard in many RDF processing engines. In addition to large repositories of geo-spatial data -- e.g., LinkedGeoData, OpenStreetMap, etc. -- spatial data is also routinely found in automatically constructed knowledgebases such as Yago and WikiData. While there have been research efforts for efficient processing of spatial data in RDF/SPARQL, very little effort has gone into building end-to-end systems that can holistically handle complex SPARQL queries along with spatial filters. In this paper, we present Streak, a RDF data management system that is designed to support a wide-range of queries with spatial filters including complex joins, top-k, higher-order relationships over spatially enriched databases. Streak introduces various novel features such as a careful identifier encoding strategy for spatial and non-spatial entities, the use of a semantics-aware Quad-tree index that allows for early-termination and a clever use of adaptive query processing with zero plan-switch cost. We show that Streak can scale to some of the largest publicly available semantic data resources such as Yago3 and LinkedGeoData which contain spatial entities and quantifiable predicates useful for result ranking. For experimental evaluations, we focus on top-k distance join queries and demonstrate that Streak outperforms popular spatial join algorithms as well as state of the art end-to-end systems like Virtuoso and PostgreSQL.

[1]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[3]  Nikos Mamoulis,et al.  An Effective Encoding Scheme for Spatial RDF Data , 2014, Proc. VLDB Endow..

[4]  Fredric C. Gey,et al.  GeoCLEF 2008: The CLEF 2008 Cross-Language Geographic Information Retrieval Track Overview , 2008, CLEF.

[5]  Thomas Heinis,et al.  TOUCH: in-memory spatial join by hierarchical data-oriented partitioning , 2013, SIGMOD '13.

[6]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[7]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[8]  Lei Zou,et al.  S-store: An Engine for Large RDF Graph Integrating Spatial Information , 2013, DASFAA.

[9]  Thomas Heinis,et al.  TRANSFORMERS: Robust spatial joins on non-uniform data distributions , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[10]  Gerhard Weikum,et al.  KOGNAC: Efficient Encoding of Large Knowledge Graphs , 2016, IJCAI.

[11]  Walid G. Aref,et al.  Rank-aware query optimization , 2004, SIGMOD '04.

[12]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[13]  Dave Kolas,et al.  Enabling the geospatial Semantic Web with Parliament and GeoSPARQL , 2012, Semantic Web.

[14]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[15]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[16]  Lei Zou,et al.  gst-Store: An Engine for Large RDF Graph Integrating Spatiotemporal Information , 2014, EDBT.

[17]  Satya S. Sahoo,et al.  Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery , 2010 .

[18]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[19]  Johannes Gehrke,et al.  An Experimental Analysis of Iterated Spatial Joins in Main Memory , 2013, Proc. VLDB Endow..

[20]  Lei Zou,et al.  gStore: Answering SPARQL Queries via Subgraph Matching , 2011, Proc. VLDB Endow..

[21]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[22]  M. Tamer Özsu,et al.  Diversified Stress Testing of RDF Data Management Systems , 2014, SEMWEB.

[23]  Jens Lehmann,et al.  LinkedGeoData: Adding a Spatial Dimension to the Web of Data , 2009, SEMWEB.

[24]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[25]  Jennifer Widom,et al.  Content-Based Routing: Different Plans for Different Data , 2005, VLDB.

[26]  R. Varshney,et al.  Supporting top-k join queries in relational databases , 2011 .

[27]  Srikanta J. Bedathur,et al.  Quark-X: An Efficient Top-K Processing Framework for RDF Quad Stores , 2016, CIKM.

[28]  Ambuj K. Singh,et al.  Top-k Spatial Joins of Probabilistic Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[30]  Emanuele Della Valle,et al.  Efficient Execution of Top-K SPARQL Queries , 2012, SEMWEB.

[31]  Nikos Mamoulis,et al.  Efficient Top-k Spatial Distance Joins , 2013, SSTD.

[32]  Anastasia Ailamaki,et al.  Adaptive Query Processing on RAW Data , 2014, Proc. VLDB Endow..

[33]  Srikanta J. Bedathur,et al.  RQ-RDF-3X: Going beyond triplestores , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[34]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[35]  Wei-Shinn Ku,et al.  Geo-Store: a spatially-augmented SPARQL query evaluation system , 2012, SIGSPATIAL/GIS.

[36]  Jens Lehmann,et al.  Managing Geospatial Linked Data in the GeoKnow Project , 2015, Semantic Web Enabled Software Engineering.

[37]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..