Tracing nested data with structural provenance for big data analytics

Big data analytics systems such as Apache Spark natively support nested data formats since they offer operators to manipulate nested lists and complex types. Compared to flat data, nested data introduces further complexity and sources of error, e.g., when developing data processing pipelines, performing auditing tasks, or performance tuning. To ease such tasks, we propose a provenance-based solution tailored to nested data processing in big data analytics systems. Unlike previous solutions, it combines (i) tracing provenance of nested datawith (ii) efficient and scalable provenance processing, leveraging a newly proposed structural provenance that traces structural manipulations through data processing pipelines in addition to data. We provide a formal definition of structural provenance, as well as methods to efficiently capture and succinctly backtrace it. We implement them in our Pebble system in Apache Spark and validate its performance and usefulness on up to 500GB of real-world data.

[1]  Limsoon Wong,et al.  Principles of Programming with Complex Objects and Collection Types , 1995, Theor. Comput. Sci..

[2]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.

[3]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[4]  Melanie Herschel,et al.  Capturing and Querying Structural Provenance in Spark with Pebble , 2019, SIGMOD Conference.

[5]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[6]  Raghav Kaushik,et al.  On Scaling Up Sensitive Data Auditing , 2013, Proc. VLDB Endow..

[7]  James Cheney,et al.  A Graph Model of Data and Workflow Provenance , 2010, TaPP.

[8]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[9]  Miryung Kim,et al.  Adding data provenance support to Apache Spark , 2017, The VLDB Journal.

[10]  Jérôme Darmont,et al.  A Survey of XML Tree Patterns , 2017, IEEE Transactions on Knowledge and Data Engineering.

[11]  Jacek Sroka,et al.  Representing MapReduce Optimisations in the Nested Relational Calculus , 2013, BNCOD.

[12]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[13]  Miryung Kim,et al.  BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[14]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[15]  Grigoris Karvounarakis,et al.  Semiring-annotated data: queries and provenance? , 2012, SGMD.

[16]  Daniel Deutch,et al.  Provenance for aggregate queries , 2011, PODS.

[17]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[18]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[19]  Shimin Chen,et al.  Exploiting Common Patterns for Tree-Structured Data , 2017, SIGMOD Conference.

[20]  Radu Stoica,et al.  Identifying hot and cold data in main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[21]  Jan Van den Bussche,et al.  Mapping the NRC Dataflow Model to the Open Provenance Model , 2008, IPAW.

[22]  James Cheney,et al.  Database Queries that Explain their Work , 2014, PPDP '14.

[23]  Chen Wang,et al.  Extended XML Tree Pattern Matching: Theories and Algorithms , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.