Strider: A Hybrid Adaptive Distributed RDF Stream Processing Engine

Real-time processing of data streams emanating from sensors is becoming a common task in Internet of Things scenarios. The key implementation goal consists in efficiently handling massive incoming data streams and supporting advanced data analytics services like anomaly detection. In an on-going, industrial project, a 24 / 7 available stream processing engine usually faces dynamically changing data and workload characteristics. These changes impact the engine’s performance and reliability. We propose Strider, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams. Strider has been designed to guarantee important industrial properties such as scalability, high availability, fault tolerance, high throughput and acceptable latency. These guarantees are obtained by designing the engine’s architecture with state-of-the-art Apache components such as Spark and Kafka. We highlight the efficiency (e.g., on a single machine machine, up to 60x gain on throughput compared to state-of-the-art systems, a throughput of 3.1 million triples/second on a 9 machines cluster, a major breakthrough in this system’s category) of Strider on real-world and synthetic data sets.

[1]  Jignesh M. Patel,et al.  Storm@twitter , 2014, SIGMOD Conference.

[2]  Vassilis Christophides,et al.  Heuristics-based query optimisation for SPARQL , 2012, EDBT '12.

[3]  Craig Chambers,et al.  The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing , 2015, Proc. VLDB Endow..

[4]  Thomas Neumann,et al.  Exploiting the query structure for efficient join ordering in SPARQL queries , 2014, EDBT.

[5]  Dave Reynolds,et al.  SPARQL basic graph pattern optimization using selectivity estimation , 2008, WWW.

[6]  Thanassis Tiropanis,et al.  SPARQL-to-SQL on Internet of Things Databases and Streams , 2016, International Semantic Web Conference.

[7]  J. S. Saini,et al.  Adaptive Query Processing , 2006 .

[8]  Jun Rao,et al.  Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[9]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[10]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[11]  Gerhard Weikum,et al.  Scalable join processing on very large RDF graphs , 2009, SIGMOD Conference.

[12]  Olivier Curé,et al.  On Measuring Performances of C-SPARQL and CQELS , 2016, SR+SWIT@ISWC.

[13]  Danh Le Phuoc,et al.  A Native and Adaptive Approach for Unified Processing of Linked Streams and Linked Data , 2011, SEMWEB.

[14]  Abraham Bernstein,et al.  Scalable Linked Data Stream Processing via Network-Aware Workload Scheduling , 2013, SSWS@ISWC.

[15]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[16]  Daniele Braga,et al.  C-SPARQL: SPARQL for continuous querying , 2009, WWW '09.

[17]  Feng Gao,et al.  CityBench: A Configurable Benchmark to Evaluate RSP Engines Using Smart City Datasets , 2015, SEMWEB.

[18]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[19]  Guido Moerkotte,et al.  Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[20]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[21]  Peter A. Boncz,et al.  Exploiting Emergent Schemas to Make RDF Systems More Efficient , 2016, SEMWEB.

[22]  Ying Zhang,et al.  SRBench: A Streaming RDF/SPARQL Benchmark , 2012, SEMWEB.

[23]  Georg Lausen,et al.  S2RDF: RDF Querying with SPARQL on Spark , 2015, Proc. VLDB Endow..

[24]  Anshul Jaiswal,et al.  Realtime Data Processing at Facebook , 2016, SIGMOD Conference.

[25]  Thomas Eiter,et al.  Linked Stream Data Processing Engines: Facts and Figures , 2012, SEMWEB.

[26]  Laura M. Haas,et al.  SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems , 2010, Proc. VLDB Endow..

[27]  Sebastian Rudolph,et al.  Stream reasoning and complex event processing in ETALIS , 2012, Semantic Web.

[28]  Hoan Quoc Nguyen-Mau,et al.  Elastic and Scalable Processing of Linked Stream Data in the Cloud , 2013, SEMWEB.