On provenance and privacy

Provenance in scientific workflows is a double-edged sword. On the one hand, recording information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions, enables transparency and reproducibility of results. On the other hand, a scientific workflow often contains private or confidential data and uses proprietary modules. Hence, providing exact answers to provenance queries over all executions of the workflow may reveal private information. In this paper we discuss privacy concerns in scientific workflows -- data, module, and structural privacy - and frame several natural questions: (i) Can we formally analyze data, module, and structural privacy, giving provable privacy guarantees for an unlimited/bounded number of provenance queries? (ii) How can we answer search and structural queries over repositories of workflow specifications and their executions, providing as much information as possible to the user while still guaranteeing privacy? We then highlight some recent work in this area and point to several directions for future work.

[1]  Susan B. Davidson,et al.  Privacy issues in scientific workflow provenance , 2010, Wands '10.

[2]  Surajit Chaudhuri,et al.  DBXplorer: a system for keyword-based search over relational databases , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[4]  Rajeev Motwani,et al.  Auditing SQL Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Shiyong Lu,et al.  Scientific Workflow Provenance Querying with Security Views , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[6]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[7]  Marianne Winslett,et al.  Introducing secure provenance: problems and challenges , 2007, StorageSS '07.

[8]  Debmalya Panigrahi,et al.  Preserving Module Privacy in Workflow Provenance , 2010, ArXiv.

[9]  Yolanda Gil,et al.  Privacy enforcement in data analysis workflows , 2007 .

[10]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[11]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[12]  Yi Chen,et al.  Searching workflows with hierarchical views , 2010, Proc. VLDB Endow..

[13]  Cynthia Dwork,et al.  The Differential Privacy Frontier (Extended Abstract) , 2009, TCC.

[14]  Andrew P. Martin,et al.  Trusted Computing and Provenance: Better Together , 2010, TaPP.

[15]  Susan B. Davidson,et al.  Detecting and resolving unsound workflow views for correct provenance analysis , 2009, SIGMOD Conference.

[16]  Elisa Bertino,et al.  Secure and selective dissemination of XML documents , 2002, TSEC.

[17]  Sabrina De Capitani di Vimercati,et al.  A fine-grained access control system for XML documents , 2002, TSEC.

[18]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[19]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[20]  Stuart S. Shapiro,et al.  Privacy by design , 2010, Commun. ACM.

[21]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[22]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[23]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[24]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[25]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[26]  Catriel Beeri,et al.  Monitoring Business Processes with Queries , 2007, VLDB.

[27]  Sanjeev Khanna,et al.  Optimizing user views for workflows , 2009, ICDT '09.

[28]  Wenfei Fan,et al.  Secure XML querying with security views , 2004, SIGMOD '04.

[29]  Margo I. Seltzer,et al.  Securing Provenance , 2008, HotSec.

[31]  Bertram Ludäscher,et al.  Actor-Oriented Design of Scientific Workflows , 2005, ER.

[32]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[33]  Luc Moreau,et al.  The Open Provenance Model , 2007 .

[34]  Debmalya Panigrahi,et al.  Provenance views for module privacy , 2010, PODS.

[35]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[36]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[37]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[38]  Julia Stoyanovich,et al.  MutaGeneSys: estimating individual disease susceptibility based on genome-wide SNP array data , 2008, Bioinform..

[39]  Luc Moreau,et al.  The Open Provenance Model: An Overview , 2008, IPAW.

[40]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[41]  Dan Suciu,et al.  Relationship privacy: output perturbation for queries with joins , 2009, PODS.

[42]  Yolanda Gil,et al.  Reasoning about the Appropriate Use of Private Data through Computational Workflows , 2010, AAAI Spring Symposium: Intelligent Information Privacy Management.