A One-Pass Private Sketch for Most Machine Learning Tasks

Differential privacy (DP) is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. Inspired by recent progress toward general-purpose data release algorithms, we propose a private sketch, or small summary of the dataset, that supports a multitude of machine learning tasks including regression, classification, density estimation, near-neighbor search, and more. Our sketch consists of randomized contingency tables that are indexed with locality-sensitive hashing and constructed with an efficient one-pass algorithm. We prove competitive error bounds for DP kernel density estimation. Existing methods for DP kernel density estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large, high-dimensional datasets in a single pass. Exhaustive experiments show that our generic sketch delivers a similar privacy-utility tradeoff when compared to existing DP methods at a fraction of the computation cost. We expect that our sketch will enable differential privacy in distributed, large-scale machine learning settings.

[1]  Benjamin I. P. Rubinstein,et al.  The Bernstein Mechanism: Function Release under Differential Privacy , 2017, AAAI.

[2]  Yu-Xiang Wang,et al.  Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain , 2018, UAI.

[3]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[4]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[5]  Larry A. Wasserman,et al.  Differential privacy for functions and functional data , 2012, J. Mach. Learn. Res..

[6]  Chen Luo,et al.  Arrays of (locality-sensitive) Count Estimators (ACE): Anomaly Detection on the Edge , 2018, WWW.

[7]  Anshumali Shrivastava,et al.  Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data , 2019, WWW.

[8]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[9]  Matthew Reimherr,et al.  Formal Privacy for Functional Data with Gaussian Perturbations , 2017, ICML.

[10]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[11]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[12]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[13]  L. Wasserman,et al.  A Statistical Framework for Differential Privacy , 2008, 0811.2501.

[14]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[15]  Ravi Kumar,et al.  LSH-Preserving Functions and Their Applications , 2012, SODA.

[16]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[17]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[18]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[19]  Anshumali Shrivastava,et al.  Diversified RACE Sampling on Data Streams Applied to Metagenomic Sequence Analysis , 2019, bioRxiv.

[20]  Bernhard Schölkopf,et al.  Differentially Private Database Release via Kernel Mean Embeddings , 2017, ICML.

[21]  Liwei Wang,et al.  Efficient Algorithm for Privately Releasing Smooth Queries , 2013, NIPS.

[22]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[23]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..