Provenance Inference Techniques: Taxonomy, comparative analysis and design challenges

Abstract Provenance has many applications in assessing the data quality, computational efficiency, security, and storage reliability. Provenance inference (PI) is the process of forming conclusions derived through any evidence or reasoning by static code analysis. However, despite its manifold applications, the subject of PI has not been thoroughly studied and emphasized upon. The main objective of this article is to provide a comprehensive review of the available literature on provenance inference techniques (PITs). To achieve this, we first identify the needs and requirements essential for PITs. Then, a thematic classification of the existing PITs is proposed in form of taxonomy. Moreover, we perform a comprehensive comparative analysis by highlighting the strengths and weaknesses of the existing literature on PITs. Furthermore, we have identified a set of design challenges, which should be taken into consideration. Finally, we conclude the paper by presenting recommendations, issues and open research challenges that should be considered by future research studies in this domain.

[1]  Carl Kesselman,et al.  Grid-based metadata services , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[2]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[3]  Mohammad Rezwanul Huq,et al.  An Inference-based Framework for Managing Data Provenance , 2013 .

[4]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[5]  Cláudio T. Silva,et al.  Provenance for Visualizations: Reproducibility and Beyond , 2007, Computing in Science & Engineering.

[6]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7]  Andreas Wombacher,et al.  Inferring Fine-Grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy , 2011, DEXA.

[8]  Sanjeev Khanna,et al.  Differencing Provenance in Scientific Workflows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Rami Rifaieh,et al.  SWAMI: Integrating Biological Databases and Analysis Tools Within User Friendly Environment , 2007, DILS.

[10]  Geoff Holmes,et al.  Security and Data Accountability in Distributed Systems: A Provenance Survey , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[11]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[12]  Olaf Hartig,et al.  Using Web Data Provenance for Quality Assessment , 2009, SWPM.

[13]  Birgit Kleinschmit,et al.  Combining machine learning and ontological data handling for multi-source classification of nature conservation areas , 2017, Int. J. Appl. Earth Obs. Geoinformation.

[14]  Thomas Heinis,et al.  Efficient lineage tracking for scientific workflows , 2008, SIGMOD Conference.

[15]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[16]  Steffen Staab,et al.  Ontology enrichment by discovering multi-relational association rules from ontological knowledge bases , 2016, SAC.

[17]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[18]  Luc Moreau,et al.  Report on the International Provenance and Annotation Workshop: (IPAW'06) 3-5 May 2006, Chicago , 2006, SGMD.

[19]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[20]  Bertram Ludäscher,et al.  Linking Prospective and Retrospective Provenance in Scripts , 2015, TaPP.

[21]  Yong Zhao,et al.  Tracking provenance in a virtual data grid , 2008 .

[22]  Dimitris Mourtzis,et al.  An Inference-based Knowledge Reuse Framework for Historical Product and Production Information Retrieval , 2016 .

[23]  Robert Ikeda Provenance in Data-Oriented Workflows , 2012 .

[24]  Bhavani M. Thuraisingham,et al.  Secure Data Provenance and Inference Control with Semantic Web , 2014 .

[25]  C. Kesselman,et al.  Montage: A Grid Enabled Image Mosaic Service for the National Virtual Observatory , 2004 .

[26]  Xin Li,et al.  Inferring User Actions from Provenance Logs , 2015, TrustCom 2015.

[27]  Margo I. Seltzer,et al.  Local clustering in provenance graphs , 2013, CIKM.

[28]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[29]  Andreas Wombacher,et al.  Probabilistic Inference of Fine-Grained Data Provenance , 2012, DEXA.

[30]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[31]  Andreas Wombacher,et al.  Fine-Grained Provenance Inference for a Large Processing Chain with Non-materialized Intermediate Views , 2012, SSDBM.

[32]  Roger S. Barga,et al.  Automatic Generation of Workflow Provenance , 2006, IPAW.

[33]  Bertram Ludäscher,et al.  Efficient provenance storage over nested data collections , 2009, EDBT '09.

[34]  Jane Hunter,et al.  Provenance Explorer - Customized Provenance Views Using Semantic Inferencing , 2006, SEMWEB.

[35]  Bu-Sung Lee,et al.  How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms , 2012, 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications.

[36]  James Cheney,et al.  Curated databases , 2008, PODS.

[37]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[38]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[39]  Bartosz Balis,et al.  Provenance Tracking in the ViroLab Virtual Laboratory , 2007, PPAM.

[40]  Brendan Jennings,et al.  Semantic aware processing of user defined inference rules to manage home networks , 2017, J. Netw. Comput. Appl..

[41]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[42]  Carole A. Goble,et al.  Semantically Linking and Browsing Provenance Logs for E-science , 2004, ICSNW.

[43]  Martin Doerr,et al.  Evolution of Workflow Provenance Information in the Presence of Custom Inference Rules , 2012, SWPM@ESWC.

[44]  Phillip W. Lord,et al.  Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB , 2013, PloS one.

[45]  Jack J. Dongarra,et al.  NetSolve: Grid enabling scientific computing environments , 2004, High Performance Computing Workshop.

[46]  Marianne Winslett,et al.  Towards a Secure and Efficient System for End-to-End Provenance , 2010, TaPP.

[47]  Deborah L. McGuinness,et al.  Knowledge Provenance Infrastructure , 2003, IEEE Data Eng. Bull..

[48]  Yun Peng,et al.  On Homeland Security and the Semantic Web: A Provenance and Trust Aware Inference Framework , 2005, AAAI Spring Symposium: AI Technologies for Homeland Security.

[49]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[50]  Chen Chen,et al.  Distributed Provenance Compression , 2017, SIGMOD Conference.

[51]  Ling Liu,et al.  ProvenanceLens: Service provenance management in the cloud , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[52]  Marta Mattoso,et al.  Towards a Taxonomy of Provenance in Scientific Workflow Management Systems , 2009, 2009 Congress on Services - I.

[53]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[54]  Umberto Straccia,et al.  A General Framework for Representing, Reasoning and Querying with Annotated Semantic Web Data , 2011, J. Web Semant..

[55]  David J. Lohman,et al.  Inferring the Provenance of an Alien Species with DNA Barcodes: The Neotropical Butterfly Dryas iulia in Thailand , 2014, PloS one.

[56]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[57]  Bertram Ludäscher,et al.  Retrospective Provenance Without a Runtime Provenance Recorder , 2015, TaPP.

[58]  Luc Moreau,et al.  The Foundations for Provenance on the Web , 2010, Found. Trends Web Sci..

[59]  Ashish Gehani,et al.  Tracking and Sketching Distributed Data Provenance , 2010, 2010 IEEE Sixth International Conference on e-Science.

[60]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[61]  Lori A. Clarke,et al.  Experience in using a process language to define scientific workflow and generate dataset provenance , 2008, SIGSOFT '08/FSE-16.

[62]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[63]  Seán O'Riain,et al.  Prov4J: A Semantic Web Framework for Generic Provenance Management , 2010, SWPM@ISWC.

[64]  Roger Barga,et al.  Automatic capture and efficient storage of e-Science experiment provenance , 2008 .

[65]  Beth Plale,et al.  Provenance analysis: Towards quality provenance , 2012, 2012 IEEE 8th International Conference on E-Science.

[66]  Yogesh L. Simmhan,et al.  The Open Provenance Model (v1.01) , 2008 .

[67]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[68]  Juliana Freire,et al.  noWorkflow: Capturing and Analyzing Provenance of Scripts , 2014, IPAW.

[69]  Yolanda Gil,et al.  Provenance trails in the Wings-Pegasus system , 2008 .

[70]  Mansoor Ahmed,et al.  Aggregated Signatures for Chaining: A Secure Provenance Scheme , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[71]  Adeel Anjum,et al.  Trustworthy data: A survey, taxonomy and future trends of secure provenance schemes , 2017, J. Netw. Comput. Appl..

[72]  Boris Glavic,et al.  Optimizing Provenance Computations , 2017, ArXiv.

[73]  Andreas Wombacher,et al.  Data Provenance Inference in Logic Programming: Reducing Effort of Instance-driven Debugging , 2013 .

[74]  H. V. Jagadish,et al.  Database management for life sciences research , 2004, SGMD.

[75]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[76]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[77]  C. Steinbeck,et al.  The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web , 2011, PloS one.

[78]  Diego Reforgiato Recupero,et al.  Annotated RDF , 2006, TOCL.

[79]  Fabio Casati,et al.  Workflow Evolution , 1996, ER.

[80]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[81]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[82]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[83]  Yogesh L. Simmhan,et al.  Performance Evaluation of the Karma Provenance Framework for Scientific Workflows , 2006, IPAW.

[84]  Andreas Wombacher,et al.  From scripts towards provenance inference , 2012, 2012 IEEE 8th International Conference on E-Science.

[85]  Jorge García Duque,et al.  AVATAR: An Advanced Multi-agent Recommender System of Personalized TV Contents by Semantic Reasoning , 2004, WISE.

[86]  Jian Zhang,et al.  Steps Toward Managing Lineage Metadata in Grid Clusters , 2009, Workshop on the Theory and Practice of Provenance.

[87]  Wang Chiew Tan Provenance in Databases: Past, Current, and Future , 2007, IEEE Data Eng. Bull..

[88]  Paul T. Groth,et al.  The Requirements of Using Provenance in e-Science Experiments , 2007, Journal of Grid Computing.

[89]  Bruno Defude,et al.  A Semantic Framework for the Management of Enriched Provenance Logs , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[90]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[91]  Yong Zhao,et al.  Applying the Virtual Data Provenance Model , 2006, IPAW.

[92]  Krishnaprasad Thirunarayan,et al.  PrOM: A Semantic Web Framework for Provenance Management in Science , 2009 .

[93]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[94]  James Cheney,et al.  A Graph Model of Data and Workflow Provenance , 2010, TaPP.

[95]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[96]  Jérôme Euzenat,et al.  Ontology Matching: State of the Art and Future Challenges , 2013, IEEE Transactions on Knowledge and Data Engineering.

[97]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[98]  Nadeem Javaid,et al.  Secure provenance using an authenticated data structure approach , 2018, Comput. Secur..

[99]  Lalana Kagal,et al.  Rule-Based Trust Assessment on the Semantic Web , 2011, RuleML Europe.

[100]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[101]  Andreas Wombacher,et al.  ProvenanceCurious: a tool to infer data provenance from scripts , 2013, EDBT '13.

[102]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[103]  Simon See,et al.  Modeling and Verifying Non-DAG Workflows for Computational Grids , 2007, 2007 IEEE Congress on Services (Services 2007).

[104]  Paul T. Groth The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes , 2007 .

[105]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[106]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[107]  Bruno Defude,et al.  A mediator-based system for distributed semantic provenance management systems , 2012, IDEAS '12.

[108]  Devarshi Ghoshal,et al.  Provenance from log files: a BigData problem , 2013, EDBT '13.

[109]  Dennis Shasha,et al.  Improving Data Cleaning Quality Using a Data Lineage Facility , 2001, DMDW.

[110]  Shiyong Lu,et al.  Storing and Querying Scientific Workflow Provenance Metadata Using an RDBMS , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).