Prochlo: Strong Privacy for Analytics in the Crowd

The large-scale monitoring of computer users' software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture---Encode, Shuffle, Analyze (ESA)---for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informed by our practical experiences with an existing, large deployment of privacy-preserving software monitoring. With ESA, the privacy of monitored users' data is guaranteed by its processing in a three-step pipeline. First, the data is encoded to control scope, granularity, and randomness. Second, the encoded data is collected in batches subject to a randomized threshold, and blindly shuffled, to break linkability and to ensure that individual data items get "lost in the crowd" of the batch. Third, the anonymous, shuffled data is analyzed by a specific analysis engine that further prevents statistical inference attacks on analysis results. ESA extends existing best-practice methods for sensitive-data analytics, by using cryptography and statistical techniques to make explicit how data is elided and reduced in precision, how only common-enough, anonymous data is analyzed, and how this is done for only specific, permitted purposes. As a result, ESA remains compatible with the established workflows of traditional database analysis. Strong privacy guarantees, including differential privacy, can be established at each processing step to defend against malice or compromise at one or more of those steps. Prochlo develops new techniques to harden those steps, including the Stash Shuffle, a novel scalable and efficient oblivious-shuffling algorithm based on Intel's SGX, and new applications of cryptographic secret sharing and blinding. We describe ESA and Prochlo, as well as experiments that validate their ability to balance utility and privacy.

[1]  Carl A. Gunter,et al.  Plausible Deniability for Privacy-Preserving Data Synthesis , 2017, Proc. VLDB Endow..

[2]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[3]  Johannes Gehrke,et al.  Crowd-Blending Privacy , 2012, IACR Cryptol. ePrint Arch..

[4]  Beng Chin Ooi,et al.  M2R: Enabling Stronger Privacy in MapReduce Computation , 2015, USENIX Security Symposium.

[5]  Ninghui Li,et al.  On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy , 2011, ASIACCS '12.

[6]  Matthias Hauswirth,et al.  Catch me if you can: performance bug detection in the wild , 2011, OOPSLA '11.

[7]  Mihir Bellare,et al.  Message-Locked Encryption and Secure Deduplication , 2013, EUROCRYPT.

[8]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[9]  Christos Gkantsidis,et al.  VC3: Trustworthy Data Analytics in the Cloud Using SGX , 2015, 2015 IEEE Symposium on Security and Privacy.

[10]  Christos Gkantsidis,et al.  Observing and Preventing Leakage in MapReduce , 2015, CCS.

[11]  Mary Baker,et al.  Secure History Preservation Through Timeline Entanglement , 2002, USENIX Security Symposium.

[12]  Dorothy E. Denning,et al.  Cryptography and Data Security , 1982 .

[13]  Ratul Mahajan,et al.  AppInsight: Mobile App Performance Monitoring in the Wild , 2022 .

[14]  Peter A. Dinda,et al.  Panappticon: Event-based tracing to measure mobile application and platform performance , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[15]  R. Shay,et al.  : Privacy on Mobile Devices – It ’ s Complicated , 2016 .

[16]  Martín Abadi,et al.  Message-Locked Encryption for Lock-Dependent Messages , 2013, IACR Cryptol. ePrint Arch..

[17]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[18]  Marco Gaboardi,et al.  PSI (Ψ): a Private data Sharing Interface , 2016, ArXiv.

[19]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[20]  Dirk Fox,et al.  Digital Signature Standard (DSS) , 2001, Datenschutz und Datensicherheit.

[21]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[22]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[23]  Thomas H. Cormen,et al.  Relaxing the problem-size bound for out-of-core columnsort , 2003, SPAA '03.

[24]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[25]  Nickolai Zeldovich,et al.  Vuvuzela: scalable private messaging resistant to traffic analysis , 2015, SOSP.

[26]  Úlfar Erlingsson,et al.  Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries , 2015, Proc. Priv. Enhancing Technol..

[27]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[28]  Sotiris Ioannidis,et al.  The Long-Standing Privacy Debate: Mobile Websites vs Mobile Apps , 2017, WWW.

[29]  Marcus Peinado,et al.  Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing , 2016, USENIX Security Symposium.

[30]  Martín Abadi Trusted Computing, Trusted Third Parties, and Verified Communications , 2004, SEC.

[31]  Marek Klonowski,et al.  Provable Anonymity for Networks of Mixes , 2005, Information Hiding.

[32]  Galen C. Hunt,et al.  Debugging in the (very) large: ten years of implementation and experience , 2009, SOSP '09.

[33]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[34]  Narseo Vallina-Rodriguez,et al.  Tracking the Trackers: Towards Understanding the Mobile Advertising and Tracking Ecosystem , 2016, ArXiv.

[35]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[36]  Dan Boneh,et al.  Prio: Private, Robust, and Scalable Computation of Aggregate Statistics , 2017, NSDI.

[37]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[38]  Ion Stoica,et al.  Opaque: An Oblivious and Encrypted Distributed Analytics Platform , 2017, NSDI.

[39]  Marcus Peinado,et al.  Controlled-Channel Attacks: Deterministic Side Channels for Untrusted Operating Systems , 2015, 2015 IEEE Symposium on Security and Privacy.

[40]  Benjamin Livshits,et al.  BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model , 2017, USENIX Security Symposium.

[41]  Ninghui Li,et al.  Locally Differentially Private Protocols for Frequency Estimation , 2017, USENIX Security Symposium.

[42]  Ratul Mahajan,et al.  Differentially-private network trace analysis , 2010, SIGCOMM '10.

[43]  Beng Chin Ooi,et al.  Privacy-Preserving Computation with Trusted Computing via Scramble-then-Compute , 2017, Proc. Priv. Enhancing Technol..

[44]  Robert K. Cunningham,et al.  SoK: Privacy on Mobile Devices – It’s Complicated , 2016, Proc. Priv. Enhancing Technol..

[45]  David Lie,et al.  Glimmers: Resolving the Privacy/Trust Quagmire , 2017, HotOS.

[46]  Petros Maniatis,et al.  Oblivious Stash Shuffle , 2017, ArXiv.

[47]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[48]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[49]  Jun Tang,et al.  Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 , 2017, ArXiv.

[50]  Paul Francis,et al.  Towards Statistical Queries over Distributed Private User Data , 2012, NSDI.

[51]  Rüdiger Kapitza,et al.  Telling Your Secrets without Page Faults: Stealthy Page Table-Based Attacks on Enclaved Execution , 2017, USENIX Security Symposium.

[52]  Dan Boneh,et al.  Riposte: An Anonymous Messaging System Handling Millions of Users , 2015, 2015 IEEE Symposium on Security and Privacy.

[53]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[54]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[55]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[56]  Dawn Xiaodong Song,et al.  Towards Practical Differential Privacy for SQL Queries , 2017, Proc. VLDB Endow..

[57]  Sasu Tarkoma,et al.  Carat: collaborative energy diagnosis for mobile devices , 2013, SenSys '13.

[58]  Adam Barth,et al.  Browser security , 2009, Commun. ACM.

[59]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[60]  Arno Fiedler,et al.  Certificate transparency , 2014, Commun. ACM.

[61]  Martín Abadi,et al.  On the Protection of Private Information in Machine Learning Systems: Two Recent Approches , 2017, 2017 IEEE 30th Computer Security Foundations Symposium (CSF).

[62]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[63]  Harald C. Gall,et al.  The making of cloud applications: an empirical study on software development for the cloud , 2014, ESEC/SIGSOFT FSE.

[64]  Thomas Zimmermann,et al.  Information needs for software development analytics , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[65]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[66]  Eli Upfal,et al.  The Melbourne Shuffle: Improving Oblivious Storage in the Cloud , 2014, ICALP.

[67]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[68]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[69]  Carl A. Gunter,et al.  Free for All! Assessing User Data Exposure to Advertising Libraries on Android , 2016, NDSS.

[70]  Jerome H. Saltzer,et al.  The protection of information in computer systems , 1975, Proc. IEEE.

[71]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1984, IEEE Transactions on Computers.