Pipeline provenance for cloud‐based big data analytics

Provenance is information about the origin and creation of data. In data science and engineering related with cloud environment, such information is useful and sometimes even critical. In data analytics, it is necessary for making data‐driven decisions to trace back history and reproduce final or intermediate results, even to tune models and adjust parameters in a real‐time fashion. Particularly, in cloud, users need to evaluate data and pipeline trustworthiness. In this paper, we propose a solution: LogProv, toward realizing these functionalities for big data provenance, which needs to renovate data pipelines or some of big data software infrastructure to generate structured logs for pipeline events, and then stores data and logs separately in cloud space. The data are explicitly linked to the logs, which implicitly record pipeline semantics. Semantic information can be retrieved from the logs easily since they are well defined and structured beforehand. We implemented and deployed LogProv in Nectar Cloud,* associated with Apache Pig, Hadoop ecosystem, and adopted Elasticsearch to provide query service. LogProv was evaluated and empirically case studied. The results show that LogProv is efficient since the performance overhead is no more than 10%; the query can be responded within 1 second; the trustworthiness is marked clearly; and there is no impact on the data processing logic of original pipelines.

[1]  Yuanyuan Zhang,et al.  Multi-objective Optimisation of Online Distributed Software Update for DevOps in Clouds , 2019, ACM Trans. Internet Techn..

[2]  Chen Shou,et al.  Distributed data provenance for large-scale data-intensive computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Yogesh L. Simmhan,et al.  Analysis of approaches for supporting the Open Provenance Model: A case study of the Trident workflow workbench , 2011, Future Gener. Comput. Syst..

[4]  Marco Mellia,et al.  Exploring the cloud from passive measurements: The Amazon AWS case , 2013, 2013 Proceedings IEEE INFOCOM.

[5]  Dongyao Wu,et al.  Building Pipelines for Heterogeneous Execution Environments for Big Data Processing , 2016, IEEE Software.

[6]  Alexander S. Szalay,et al.  Reliable Management of Community Data Pipelines using Scientific Workflows , 2009 .

[7]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[8]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[9]  Vasa Curcin,et al.  Data Provenance and Data Management in eScience , 2013 .

[10]  Yogesh L. Simmhan,et al.  Cloud-Based Software Platform for Big Data Analytics in Smart Grids , 2013, Computing in Science & Engineering.

[11]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[12]  Prem Prakash Jayaraman,et al.  IOTSim: A simulator for analysing IoT applications , 2017, J. Syst. Archit..

[13]  Paul T. Groth,et al.  PROV-O-Viz - Understanding the Role of Activities in Provenance , 2014, IPAW.

[14]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[15]  Julian Soh,et al.  Microsoft Azure and Cloud Computing , 2020, Microsoft Azure.

[16]  Yuan Luo,et al.  Pothole in the Dark: Perceiving Pothole Profiles with Participatory Urban Vehicles , 2017, IEEE Transactions on Mobile Computing.

[17]  Liming Zhu,et al.  Non-Intrusive Anomaly Detection With Streaming Performance Metrics and Logs for DevOps in Public Clouds: A Case Study in AWS , 2016, IEEE Transactions on Emerging Topics in Computing.

[18]  Minglu Li,et al.  Ada-Things: An adaptive virtual machine monitoring and migration strategy for internet of things applications , 2019, J. Parallel Distributed Comput..

[19]  Liming Zhu,et al.  R2C: Robust Rolling-Upgrade in Clouds , 2018, IEEE Trans. Dependable Secur. Comput..

[20]  Yogesh L. Simmhan,et al.  A survey of data provenance techniques , 2005 .

[21]  Leonard J. Bass,et al.  Rollup: Non-Disruptive Rolling Upgrade with Fast Consensus-Based Dynamic Reconfigurations , 2016, IEEE Transactions on Parallel and Distributed Systems.

[22]  Xiaoyong Du,et al.  Elite: an elastic infrastructure for big spatiotemporal trajectories , 2016, The VLDB Journal.

[23]  Yogesh L. Simmhan,et al.  Provenance for Scientific Workflows Towards Reproducible Research , 2010, IEEE Data Eng. Bull..

[24]  Torben Bach Pedersen,et al.  OLAP over probabilistic data cubes I: Aggregating, materializing, and querying , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[25]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[26]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[27]  Jinjun Chen,et al.  Public Auditing for Big Data Storage in Cloud Computing -- A Survey , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[28]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[29]  Liming Zhu,et al.  Statistically managing cloud operations for latency-tail-tolerance in IoT-enabled smart cities , 2019, J. Parallel Distributed Comput..

[30]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[31]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[32]  Klaus R. Dittrich,et al.  Data Provenance: A Categorization of Existing Approaches , 2007, BTW.

[33]  Elisa Bertino,et al.  Assuring Data Trustworthiness - Concepts and Research Challenges , 2010, Secure Data Management.

[34]  Dongyao Wu,et al.  A Pipeline Framework for Heterogeneous Execution Environment of Big Data Processing , 2016 .

[35]  Boris Glavic Big Data Provenance: Challenges and Implications for Benchmarking , 2012, WBDB.

[36]  Guoqiang Li,et al.  LogProv: Logging events as provenance of big data analytics pipelines with trustworthiness , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[37]  Domenico Talia,et al.  Clouds for Scalable Big Data Analytics , 2013, Computer.

[38]  Rajiv Ranjan,et al.  Towards building a data-intensive index for big data computing - A case study of Remote Sensing data processing , 2015, Inf. Sci..

[39]  Murat Demirbas,et al.  Google cloud messaging (GCM): An evaluation , 2014, 2014 IEEE Global Communications Conference.