Differential Privacy for Coverage Analysis of Software Traces (Artifact)

This work considers software execution traces, where a trace is a sequence of run-time events. Each user of a software system collects the set of traces covered by her execution of the software, and reports this set to an analysis server. Our goal is to report the local data of each user in a privacy-preserving manner by employing local differential privacy, a powerful theoretical framework for designing privacy-preserving data analysis. A significant advantage of such analysis is that it offers principled “built-in” privacy with clearly-defined and quantifiable privacy protections. In local differential privacy, the data of an individual user is modified using a local randomizer before being sent to the untrusted analysis server. Based on the randomized information from all users, the analysis server computes, for each trace, an estimate of how many users have covered it. Such analysis requires that the domain of possible traces be defined ahead of time. Unlike in prior related work, here the domain is either infinite or, at best, restricted to many billions of elements. Further, the traces in this domain typically have structure defined by the static properties of the software. To capture these novel aspects, we define the trace domain with the help of context-free grammars. We illustrate this approach with two exemplars: a call chain analysis in which traces are described through a regular language, and an enter/exit trace analysis in which traces are described by a balanced-parentheses context-free language. Randomization over such domains is challenging due to their large size, which makes it impossible to use prior randomization techniques. To solve this problem, we propose to use count sketch, a fixed-size hashing data structure for summarizing frequent items. We develop a version of count sketch for trace analysis and demonstrate its suitability for software execution data. In addition, instead of randomizing separately each contribution to the sketch, we develop a much-faster one-shot randomization of the accumulated sketch data. One important client of the collected information is the identification of high-frequency (“hot”) traces. We develop a novel approach to identify hot traces from the collected randomized sketches. A key insight is that the very large domain of possible traces can be efficiently explored for hot traces by using the frequency estimates of a visited trace and its prefixes and suffixes. Our experimental study of both call chain analysis and enter/exit trace analysis indicates that the frequency estimates, as well as the identification of hot traces, achieve high accuracy and high privacy. 2012 ACM Subject Classification Software and its engineering → Dynamic analysis; Security and privacy → Privacy-preserving protocols

[1]  Manu Sridharan,et al.  Refinement-based context-sensitive points-to analysis for Java , 2006, PLDI '06.

[2]  Jong-Deok Choi,et al.  Accurate, efficient, and adaptive calling context profiling , 2006, PLDI '06.

[3]  Loris D'Antoni,et al.  Control-flow recovery from partial failure reports , 2017, PLDI.

[4]  Alessandro Orso,et al.  A Technique for Enabling and Supporting Debugging of Field Failures , 2007, 29th International Conference on Software Engineering (ICSE'07).

[5]  David Lo,et al.  kb-anonymity: a model for anonymized behaviour-preserving test and debugging data , 2011, PLDI '11.

[6]  James R. Larus,et al.  Optimally Profiling and Tracing , 1994 .

[7]  Danfeng Zhang,et al.  Proving differential privacy with shadow execution , 2019, PLDI.

[8]  Alessandro Orso,et al.  Applying classification techniques to remotely-collected program execution data , 2005, ESEC/FSE-13.

[9]  Eran Toch,et al.  Privacy by designers: software developers’ privacy mindset , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[10]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[11]  David Lo,et al.  kbe-anonymity: test data anonymization for evolving programs , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[12]  Atanas Rountev,et al.  Differentially-private software frequency profiling under linear constraints , 2020, Proc. ACM Program. Lang..

[13]  Atanas Rountev,et al.  Introducing Privacy in Screen Event Frequency Analysis for Android Apps , 2019, 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[14]  Mu Zhang,et al.  Duet: an expressive higher-order language and linear type system for statically enforcing differential privacy , 2019, Proc. ACM Program. Lang..

[15]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[16]  Xiangyu Zhang,et al.  Precise Calling Context Encoding , 2010, IEEE Transactions on Software Engineering.

[17]  Chen Fu,et al.  Is Data Privacy Always Good for Software Testing? , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[18]  Michal Young,et al.  Residual test coverage monitoring , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[19]  Alessandro Orso,et al.  Gamma system: continuous evolution of software after deployment , 2002, ISSTA '02.

[20]  Matthew Arnold,et al.  A Survey of Adaptive Optimization in Virtual Machines , 2005, Proceedings of the IEEE.

[21]  Atanas Rountev,et al.  A study of event frequency profiling with differential privacy , 2020, CC.

[22]  Raef Bassily,et al.  Practical Locally Private Heavy Hitters , 2017, NIPS.

[23]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[24]  Alessandro Orso,et al.  F3: fault localization for field failures , 2013, ISSTA.

[25]  Tim Menzies,et al.  Privacy and utility for defect prediction: Experiments with MORPH , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[26]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[27]  Benjamin Livshits,et al.  BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model , 2017, USENIX Security Symposium.

[28]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[29]  Danfeng Zhang,et al.  LightDP: towards automating differential privacy proofs , 2016, POPL.

[30]  Alessandro Orso,et al.  Leveraging field data for impact analysis and regression testing , 2003, ESEC/FSE-11.

[31]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[32]  Thomas W. Reps,et al.  Program analysis via graph reachability , 1997, Inf. Softw. Technol..

[33]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[34]  Xiao-Yuan Jing,et al.  On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[35]  Sebastian G. Elbaum,et al.  An empirical study of profiling strategies for released software and their impact on testing activities , 2004, ISSTA '04.

[36]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[37]  Rayid Ghani,et al.  Testing software in age of data privacy: a balancing act , 2011, ESEC/FSE '11.

[38]  Alessandra Gorla,et al.  Automated Test Input Generation for Android: Are We There Yet? (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[39]  Alessandro Orso,et al.  Camouflage: automated anonymization of field data , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[40]  Myra B. Cohen,et al.  Probe Distribution Techniques to Profile Events in Deployed Software , 2006, 2006 17th International Symposium on Software Reliability Engineering.

[41]  Alessandro Orso,et al.  BugRedux: Reproducing field failures for in-house debugging , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[42]  Andreas Haeberlen,et al.  Testing differential privacy with dual interpreters , 2020, Proc. ACM Program. Lang..

[43]  Thomas Steinke,et al.  Differential Privacy: A Primer for a Non-Technical Audience , 2018 .

[44]  Raef Bassily,et al.  Differentially-Private Control-Flow Node Coverage for Software Usage Analysis , 2020, USENIX Security Symposium.

[45]  Jeff T. Linderoth,et al.  Optimizing customized program coverage , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[46]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[47]  Andreas Haeberlen,et al.  Differential Privacy: An Economic Method for Choosing Epsilon , 2014, 2014 IEEE 27th Computer Security Foundations Symposium.

[48]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[49]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[50]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[51]  Ninghui Li,et al.  Locally Differentially Private Protocols for Frequency Estimation , 2017, USENIX Security Symposium.