Fast SPARQL join processing between distributed streams and stored RDF graphs using bloom filters

The growth of real-time data generation and stored data leads us to be constantly in thinking about the three V's big data challenges: volume, velocity and variety. Existing RDF Stream Processing (RSP) systems have solved the variety lock by defining a common model for producing, transmitting and continuously querying data in RDF model. On the volume and velocity side, the performances of RSP systems need to be improved particularly in terms of joins process between stored and streaming RDF graphs. Stored RDF data are very important in streaming context (related ontologies, summarized RDF data, non-evolutive RDF data or evolve very slowly over time, etc.) but existing RSP systems such as C-SPARQL, CQELS, SPARQLstream, EP-SPARQL, Sparkwave, etc. use non-optimized and non-scalable approaches for performing join operations between stored and dynamic RDF data. Indeed, these systems need to read the entire local or remote stored RDF data sets while RDF data streams continuously arrived and need to be processed in near real-time. This latency may negatively affect performances in terms of continuous processing and often causes multiple bottlenecks within the network in a distributed environment. That also makes impractical to refresh data or update the stored contents. This paper proposes an approach for distributed real-time joins between stored and streaming RDF graphs using Bloom filters. The join procedure consists of adding fast processing by greatly reducing intermediate results, in-memory indices storage and precomputing query partitions according to the picked SPARQL query variable(s) between the two natures of RDF data. Experimental and evaluations results confirm the performances gained with our approach which significantly speeds up the query processing compared to the actual RSP's techniques.

[1]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[2]  S. Kotoulas,et al.  High-performance Distributed Stream Reasoning using S4 , 2011 .

[3]  Dieter Fensel,et al.  Sparkwave: continuous schema-enhanced pattern matching over RDF data streams , 2012, DEBS.

[4]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[5]  Feng Gao,et al.  CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets , 2015, SEMWEB.

[6]  Alasdair J. G. Gray,et al.  Enabling Ontology-Based Access to Streaming Data Sources , 2010, SEMWEB.

[7]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[8]  Marcelo Arenas,et al.  Semantics and Complexity of SPARQL , 2006, International Semantic Web Conference.

[9]  Thomas Eiter,et al.  Linked Stream Data Processing Engines: Facts and Figures , 2012, SEMWEB.

[10]  Daniele Braga,et al.  An execution environment for C-SPARQL queries , 2010, EDBT '10.

[11]  Ying Zhang,et al.  SRBench: A Streaming RDF/SPARQL Benchmark , 2012, SEMWEB.

[12]  Emanuele Della Valle,et al.  Approximate Continuous Query Answering over Streams and Dynamic Linked Data Sets , 2015, ICWE.

[13]  Hoan Quoc Nguyen-Mau,et al.  Elastic and Scalable Processing of Linked Stream Data in the Cloud , 2013, SEMWEB.

[14]  Zakia Kazi-Aoul,et al.  DRSS: Distributed RDF SPARQL Streaming , 2017, SERA.

[15]  Danh Le Phuoc,et al.  A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data , 2011, SEMWEB.

[16]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[17]  Kyong-Ho Lee,et al.  Proactive Replication of Dynamic Linked Data for Scalable RDF Stream Processing , 2016, International Semantic Web Conference.

[18]  Gauthier Picard,et al.  DIONYSUS: Towards Query-aware Distributed Processing of RDF Graph Streams , 2016, EDBT/ICDT Workshops.

[19]  Sebastian Rudolph,et al.  EP-SPARQL: a unified language for event processing and stream reasoning , 2011, WWW.

[20]  Sebastian Rudolph,et al.  ETALIS: Rule-Based Reasoning in Event Processing , 2011 .

[21]  Abraham Bernstein,et al.  Scalable Linked Data Stream Processing via Network-Aware Workload Scheduling , 2013, SSWS@ISWC.