kb-anonymity: a model for anonymized behaviour-preserving test and debugging data

It is often very expensive and practically infeasible to generate test cases that can exercise all possible program states in a program. This is especially true for a medium or large industrial system. In practice, industrial clients of the system often have a set of input data collected either before the system is built or after the deployment of a previous version of the system. Such data are highly valuable as they represent the operations that matter in a client's daily business and may be used to extensively test the system. However, such data often carries sensitive information and cannot be released to third-party development houses. For example, a healthcare provider may have a set of patient records that are strictly confidential and cannot be used by any third party. Simply masking sensitive values alone may not be sufficient, as the correlation among fields in the data can reveal the masked information. Also, masked data may exhibit different behavior in the system and become less useful than the original data for testing and debugging. For the purpose of releasing private data for testing and debugging, this paper proposes the kb-anonymity model, which combines the k-anonymity model commonly used in the data mining and database areas with the concept of program behavior preservation. Like k-anonymity, kb-anonymity replaces some information in the original data to ensure privacy preservation so that the replaced data can be released to third-party developers. Unlike k-anonymity, kb-anonymity ensures that the replaced data exhibits the same kind of program behavior exhibited by the original data so that the replaced data may still be useful for the purposes of testing and debugging. We also provide a concrete version of the model under three particular configurations and have successfully applied our prototype implementation to three open source programs, demonstrating the utility and scalability of our prototype.

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Willem Visser,et al.  Model Checking Programs with Java PathFinder , 2005, SPIN.

[3]  Adam Kiezun,et al.  jFuzz: A Concolic Whitebox Fuzzer for Java , 2009, NASA Formal Methods.

[4]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[5]  Corina S. Pasareanu,et al.  JPF-SE: A Symbolic Execution Extension to Java PathFinder , 2007, TACAS.

[6]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[7]  Koushik Sen DART: Directed Automated Random Testing , 2009, Haifa Verification Conference.

[8]  Byung-Gon Chun,et al.  TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones , 2010, OSDI.

[9]  Andrew C. Myers,et al.  Language-based information-flow security , 2003, IEEE J. Sel. Areas Commun..

[10]  Raúl A. Santelices,et al.  Exploiting program dependencies for scalable multiple-path symbolic execution , 2010, ISSTA '10.

[11]  Rajeev Motwani,et al.  Approximation Algorithms for k-Anonymity , 2005 .

[12]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[13]  Stephen McCamant,et al.  Quantitative information flow as network flow capacity , 2008, PLDI '08.

[14]  Daniel M. Germán,et al.  An exploratory study of the evolution of software licensing , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[15]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[16]  Rajiv Gupta,et al.  Fault localization using value replacement , 2008, ISSTA '08.

[17]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[18]  Matthew B. Dwyer,et al.  Differential symbolic execution , 2008, SIGSOFT '08/FSE-16.

[19]  Benjamin Livshits,et al.  Merlin: specification inference for explicit information flow problems , 2009, PLDI '09.

[20]  Alessandro Orso,et al.  Test-Suite Augmentation for Evolving Software , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[21]  XiaoFeng Wang,et al.  Panalyst: Privacy-Aware Remote Error Analysis on Commodity Software , 2008, USENIX Security Symposium.

[22]  Patrice Godefroid,et al.  Automated Whitebox Fuzz Testing , 2008, NDSS.

[23]  Jane Cleland-Huang,et al.  A machine learning approach for tracing regulatory codes to product specific requirements , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[24]  Stéphane Frénot,et al.  Liability in software engineering: overview of the LISE approach and illustration on a case study , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[25]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[26]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[27]  Jon M. Kleinberg,et al.  Wherefore art thou R3579X? , 2011, Commun. ACM.

[28]  A. Zeller Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[29]  Frank Tip,et al.  Directed test generation for effective fault localization , 2010, ISSTA '10.

[30]  Dawn Xiaodong Song,et al.  TaintEraser: protecting sensitive data leaks using application-level taint tracking , 2011, OPSR.

[31]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[32]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[33]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[34]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[35]  David Notkin,et al.  Symstra: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution , 2005, TACAS.

[36]  Raúl A. Santelices,et al.  Lightweight fault-localization using multiple coverage types , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[37]  Xiangyu Zhang,et al.  Locating faults through automated predicate switching , 2006, ICSE.

[38]  Peter M. Broadwell,et al.  Scrash: A System for Generating Secure Crash Information , 2003, USENIX Security Symposium.

[39]  Sarfraz Khurshid,et al.  Generalized Symbolic Execution for Model Checking and Testing , 2003, TACAS.