Data Provenance: What next?

Research into data provenance has been active for almost twenty years. What has it delivered and where will it go next? What practical impact has it had and what might it have? We provide speculative answers to these questions which may be somewhat biased by our initial motivation for studying the topic: the need for provenance information in curated databases. Such databases involve extensive human interaction with data; and we argue that the need continues in other forms of human interaction such as those that take place in social media.

[1]  Wenfei Fan,et al.  Keys for XML , 2001, WWW '01.

[2]  Stuart E. Madnick,et al.  A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective , 1990, VLDB.

[3]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[4]  Sachin Shetty,et al.  ProvChain: A Blockchain-Based Data Provenance Architecture in Cloud Environment with Enhanced Privacy and Availability , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[5]  Wang Chiew Tan,et al.  DBNotes: a post-it system for relational databases based on provenance , 2005, SIGMOD '05.

[6]  Daniel Deutch,et al.  A Model for Fine-Grained Data Citation , 2017, CIDR.

[7]  Carlo Curino,et al.  PRIMA: archiving and querying historical data with evolving schemas , 2009, SIGMOD Conference.

[8]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[9]  Sören Auer,et al.  A systematic review of open government data initiatives , 2015, Gov. Inf. Q..

[10]  Ioana Manolescu,et al.  Computational fact-checking: a content management perspective , 2018, Proc. VLDB Endow..

[11]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[12]  Daniel G. Goldstein,et al.  Manipulating and Measuring Model Interpretability , 2018, CHI.

[13]  Val Tannen,et al.  Provenance in ORCHESTRA , 2010, IEEE Data Eng. Bull..

[14]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[15]  Andreas Haeberlen,et al.  Secure network provenance , 2011, SOSP.

[16]  Daniel S. Kermany,et al.  Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning , 2018, Cell.

[17]  Pankaj K. Agarwal,et al.  Computational Fact Checking through Query Perturbations , 2017, ACM Trans. Database Syst..

[18]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[19]  Vassilis Christophides,et al.  Algebraic structures for capturing the provenance of SPARQL queries , 2013, ICDT '13.

[20]  Yolanda Gil,et al.  PROV-DM: The PROV Data Model , 2013 .

[21]  Huan Liu,et al.  Provenance Data in Social Media , 2013, Synthesis Lectures on Data Mining and Knowledge Discovery.

[22]  Laks V. S. Lakshmanan,et al.  FastQRE: Fast Query Reverse Engineering , 2018, SIGMOD Conference.

[23]  M. Gentzkow,et al.  Social Media and Fake News in the 2016 Election , 2017 .

[24]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[25]  Devavrat Shah,et al.  Rumors in a Network: Who's the Culprit? , 2009, IEEE Transactions on Information Theory.

[26]  Melanie Herschel,et al.  A survey on provenance: What for? What form? What from? , 2017, The VLDB Journal.

[27]  Aditya G. Parameswaran,et al.  Decibel: The Relational Dataset Branching System , 2016, Proc. VLDB Endow..

[28]  Finale Doshi-Velez,et al.  A Roadmap for a Rigorous Science of Interpretability , 2017, ArXiv.

[29]  Evgeny Sherkhonov,et al.  High-Level Why-Not Explanations using Ontologies , 2014, PODS.

[30]  David B. Searls,et al.  SORTEZ: a relational translator for NCBI's ASN.1 database , 1994, Comput. Appl. Biosci..

[31]  Yogesh L. Simmhan,et al.  Special Issue: The First Provenance Challenge , 2008, Concurr. Comput. Pract. Exp..

[32]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[33]  Angela Bonifati,et al.  Learning Join Queries from User Examples , 2016, ACM Trans. Database Syst..

[34]  Val Tannen,et al.  Semiring Provenance for First-Order Model Checking , 2017, ArXiv.

[35]  Dimitrios Gunopulos,et al.  Finding effectors in social networks , 2010, KDD.

[36]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[37]  Sarah Callaghan,et al.  Joint declaration of data citation principles , 2014 .

[38]  Pierre Senellart,et al.  Provenance and Probabilities in Relational Databases , 2018, SGMD.

[39]  Aditya G. Parameswaran,et al.  OrpheusDB: Bolt-on Versioning for Relational Databases , 2017, Proc. VLDB Endow..

[40]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[41]  Norman May,et al.  Benchmarking Bitemporal Database Systems: Ready for the Future or Stuck in the Past? , 2014, EDBT.

[42]  Christos Faloutsos,et al.  Spotting Culprits in Epidemics: How Many and Which Ones? , 2012, 2012 IEEE 12th International Conference on Data Mining.

[43]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[44]  Satoshi Nakamoto Bitcoin : A Peer-to-Peer Electronic Cash System , 2009 .

[45]  Floris Geerts,et al.  MONDRIAN: Annotating and Querying Databases through Colors and Blocks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[46]  Xuezhi Wang,et al.  Relevant Document Discovery for Fact-Checking Articles , 2018, WWW.

[47]  Phokion G. Kolaitis,et al.  Active Learning of GAV Schema Mappings , 2018, PODS.

[48]  Grigoris Karvounarakis,et al.  Semiring-annotated data: queries and provenance? , 2012, SGMD.

[49]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[50]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  James Cheney,et al.  On the expressiveness of implicit provenance in query and update languages , 2008, TODS.

[52]  Boris Glavic,et al.  GProM - A Swiss Army Knife for Your Provenance Needs , 2018, IEEE Data Eng. Bull..

[53]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[54]  Paul T. Groth,et al.  Provenance: An Introduction to PROV , 2013, Provenance.

[55]  Shawn Bowers Scientific Workflow, Provenance, and Data Modeling Challenges and Approaches , 2012, Journal on Data Semantics.

[56]  GilYolanda,et al.  Special Issue: The First Provenance Challenge , 2008 .

[57]  Krishna G. Kulkarni,et al.  Temporal features in SQL:2011 , 2012, SGMD.

[58]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[59]  Gustavo Alonso,et al.  Using SQL for Efficient Generation and Querying of Provenance Information , 2013, In Search of Elegance in the Theory and Practice of Computation.

[60]  Sungyong Seo,et al.  CSI: A Hybrid Deep Model for Fake News Detection , 2017, CIKM.

[61]  Robert E. Tarjan,et al.  Making data structures persistent , 1986, STOC '86.

[62]  James Frew,et al.  Why data citation is a computational problem , 2016, Commun. ACM.

[63]  Huan Liu,et al.  Recovering information recipients in social media via provenance , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[64]  Alin Deutsch,et al.  Query reformulation with constraints , 2006, SGMD.

[65]  Huan Liu,et al.  A tool for collecting provenance data in social media , 2013, KDD.

[66]  Andreas Rauber,et al.  Scalable data citation in dynamic, large databases: Model and reference implementation , 2013, 2013 IEEE International Conference on Big Data.

[67]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[68]  S. Veenadhari,et al.  Machine learning approach for forecasting crop yield based on climatic parameters , 2014, 2014 International Conference on Computer Communication and Informatics.

[69]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[70]  Carlo Curino,et al.  Update Rewriting and Integrity Constraint Maintenance in a Schema Evolution Support System: PRISM++ , 2010, Proc. VLDB Endow..

[71]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[72]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[73]  Margo I. Seltzer,et al.  Provenance: a future history , 2009, OOPSLA Companion.

[74]  Massimo Di Pierro,et al.  Reputation Systems for News on Twitter: A Large-Scale Study , 2018, ArXiv.

[75]  John F. Roddick,et al.  A survey of schema versioning issues for database systems , 1995, Inf. Softw. Technol..

[76]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).