GeneaLog: Fine-Grained Data Streaming Provenance at the Edge

Fine-grained data provenance in data streaming allows linking each result tuple back to the source data that contributed to it, something beneficial for many applications (e.g., to find the conditions triggering a security- or safety-related alert). Further, when data transmission or storage has to be minimized, as in edge computing and cyber-physical systems, it can help in identifying the source data to be prioritized. The memory and processing costs of fine-grained data provenance, possibly afforded by high-end servers, can be prohibitive for the resource-constrained devices deployed in edge computing and cyber-physical systems. Motivated by this challenge, we present GeneaLog, a novel fine-grained data provenance technique for data streaming applications. Leveraging the logical dependencies of the data, GeneaLog takes advantage of cross-layer properties of the software stack and incurs a minimal, constant size per-tuple overhead. Furthermore, it allows for a modular and efficient algorithmic implementation using only standard data streaming operators. This is particularly useful for distributed streaming applications since the provenance processing can be executed at separate nodes, orthogonal to the data processing. We evaluate an implementation of GeneaLog using vehicular and smart grid applications, confirming it efficiently captures fine-grained provenance data with minimal overhead.

[1]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[2]  Archan Misra,et al.  A time-and-value centric provenance model and architecture for medical event streams , 2007, HealthNet '07.

[3]  Philippas Tsigas,et al.  Fast and lock-free concurrent priority queues for multi-thread systems , 2005, J. Parallel Distributed Comput..

[4]  Maged M. Michael Hazard pointers: safe memory reclamation for lock-free objects , 2004, IEEE Transactions on Parallel and Distributed Systems.

[5]  Nesime Tatbul,et al.  Efficient Stream Provenance via Operator Instrumentation , 2014, ACM Trans. Internet Techn..

[6]  Peter R. Pietzuch,et al.  THEMIS: Fairness in Federated Stream Processing under Overload , 2016, SIGMOD Conference.

[7]  Maurizio Morisio,et al.  Connected Car , 2016, ACM Comput. Surv..

[8]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[9]  Stefania Costache,et al.  Understanding the data-processing challenges in Intelligent Vehicular Systems , 2016, 2016 IEEE Intelligent Vehicles Symposium (IV).

[10]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[11]  HerschelMelanie,et al.  A survey on provenance , 2017, VLDB 2017.

[12]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[13]  Michael Stonebraker,et al.  Linear Road: A Stream Data Management Benchmark , 2004, VLDB.

[14]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[15]  Alfredo Cuzzocrea Provenance Research Issues and Challenges in the Big Data Era , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[16]  Christof Fetzer,et al.  Quality-Driven Continuous Query Execution over Out-of-Order Data Streams , 2015, SIGMOD Conference.

[17]  Michael Stonebraker,et al.  High-availability algorithms for distributed stream processing , 2005, 21st International Conference on Data Engineering (ICDE'05).

[18]  Melanie Herschel,et al.  Efficient Computation of Polynomial Explanations of Why-Not Questions , 2015, CIKM.

[19]  Reynold Xin,et al.  Apache Spark , 2016 .

[20]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[21]  Marina Papatriantafilou,et al.  Efficient Data Streaming Multiway Aggregation through Concurrent Algorithmic Designs and New Abstract Data Types , 2016, ACM Trans. Parallel Comput..

[22]  Beth Plale,et al.  Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering , 2006, IPAW.

[23]  Andreas Wombacher,et al.  Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at Lower Storage Costs , 2011, 2011 IEEE Seventh International Conference on eScience.

[24]  Christopher Olston,et al.  Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows , 2011, SIGMOD '11.

[25]  Marina Papatriantafilou,et al.  Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[26]  Stuart E. Madnick,et al.  A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective , 1990, VLDB.

[27]  Marina Papatriantafilou,et al.  Viper: A module for communication-layer determinism and scaling in low-latency stream processing , 2018, Future Gener. Comput. Syst..

[28]  Bugra Gedik,et al.  Visual Debugging for Stream Processing Applications , 2010, RV.

[29]  Luc Moreau,et al.  PROV-Overview. An Overview of the PROV Family of Documents , 2013 .

[30]  Vincenzo Gulisano,et al.  StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. (StreamCloud: un moteur de traitement de streams parallèle et distribué) , 2013 .

[31]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[32]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[33]  Ying Li,et al.  Microsoft CEP Server and Online Behavioral Targeting , 2009, Proc. VLDB Endow..

[34]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[35]  Michael Stonebraker,et al.  Fault-tolerance in the Borealis distributed stream processing system , 2005, SIGMOD '05.

[36]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..