An on-the-fly provenance tracking mechanism for stream processing systems

Applications that operate over streaming data with high-volume and real-time processing requirements are becoming increasingly important. These applications process streaming data in real-time and deliver instantaneous responses to support precise and on-time decisions. In such systems, traceability - the ability to verify and investigate the source of a particular output - in real-time is extremely important. This ability allows raw streaming data to be checked and processing steps to be verified and validated in timely manner. Therefore, it is crucial that stream systems have a mechanism for dynamically tracking provenance - the process that produced result data - at execution time, which we refer to as on-the-fly stream provenance tracking. In this paper, we propose a novel on-the-fly provenance tracking mechanism that enables provenance queries to be performed dynamically without requiring provenance assertions to be stored persistently. We demonstrate how our provenance mechanism works by means of an on-the-fly provenance tracking algorithm. The experimental evaluation shows that our provenance solution does not have a significant effect on the normal processing of stream systems given a 7% overhead. Moreover, our provenance solution offers low-latency processing (0.3 ms per additional component) with reasonable memory consumption.

[1]  Andreas Wombacher,et al.  Facilitating fine grained data provenance using temporal data model , 2010, DMSN '10.

[2]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[3]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[4]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[5]  Luc Moreau,et al.  Stream ancestor function: A mechanism for fine-grained provenance in stream processing systems , 2012, 2012 Sixth International Conference on Research Challenges in Information Science (RCIS).

[6]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[7]  Paul T. Groth,et al.  Recording Process Documentation for Provenance , 2009, IEEE Transactions on Parallel and Distributed Systems.

[8]  Beth Plale,et al.  Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering , 2006, IPAW.

[9]  Archan Misra,et al.  Advances and Challenges for Scalable Provenance in Stream Processing Systems , 2008, IPAW.

[10]  Paul T. Groth,et al.  A model of process documentation to determine provenance in mash-ups , 2009, TOIT.

[11]  Wang Chiew Tan Provenance in Databases: Past, Current, and Future , 2007, IEEE Data Eng. Bull..

[12]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[13]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[14]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[15]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .