INSPECTRE: Privately Estimating the Unseen

We develop differentially private methods for estimating various distributional properties. Given a sample from a discrete distribution $p$, some functional $f$, and accuracy and privacy parameters $\alpha$ and $\varepsilon$, the goal is to estimate $f(p)$ up to accuracy $\alpha$, while maintaining $\varepsilon$-differential privacy of the sample. We prove almost-tight bounds on the sample size required for this problem for several functionals of interest, including support size, support coverage, and entropy. We show that the cost of privacy is negligible in a variety of settings, both theoretically and experimentally. Our methods are based on a sensitivity analysis of several state-of-the-art methods for estimating these properties with sublinear sample complexities.

[1]  Daniel Kifer,et al.  Revisiting Differentially Private Hypothesis Tests for Categorical Data , 2015 .

[2]  Ryan M. Rogers,et al.  Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing , 2016, ICML 2016.

[3]  Huanyu Zhang,et al.  Differentially Private Testing of Identity and Closeness of Discrete Distributions , 2017, NeurIPS.

[4]  Vishesh Karwa,et al.  Finite Sample Differentially Private Confidence Intervals , 2017, ITCS.

[5]  Or Sheffet,et al.  Differentially Private Ordinary Least Squares , 2015, ICML.

[6]  Andrew Thangaraj,et al.  Minimax risk for missing mass estimation , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[7]  Jerry Li,et al.  Privately Learning High-Dimensional Distributions , 2018, COLT.

[8]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[9]  Huanyu Zhang,et al.  Communication Efficient, Sample Optimal, Linear Time Locally Private Discrete Distribution Estimation , 2018, ArXiv.

[10]  Ilias Diakonikolas,et al.  Differentially Private Learning of Structured Discrete Distributions , 2015, NIPS.

[11]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[12]  Anand D. Sarwate,et al.  A near-optimal algorithm for differentially-private principal components , 2012, J. Mach. Learn. Res..

[13]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[14]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[15]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[16]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[17]  Ryan M. Rogers,et al.  Leveraging Privacy In Data Analysis , 2017 .

[18]  Ronitt Rubinfeld,et al.  Differentially Private Identity and Closeness Testing of Discrete Distributions , 2017, ArXiv.

[19]  James Zou,et al.  Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[20]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[21]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[22]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[23]  Sebastian Nowozin,et al.  Improved Information Gain Estimates for Decision Tree Induction , 2012, ICML.

[24]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[25]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[26]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  Michael J. Berry,et al.  The structure and precision of retinal spike trains. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Martin J. Wainwright,et al.  Minimax Optimal Procedures for Locally Private Estimation , 2016, ArXiv.

[30]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[31]  Constantinos Daskalakis,et al.  Priv'IT: Private and Sample Efficient Identity Testing , 2017, ICML.

[32]  Jun Sakuma,et al.  Differentially Private Chi-squared Test by Unit Circle Mechanism , 2017, ICML.

[33]  Marco Gaboardi,et al.  Local Private Hypothesis Testing: Chi-Square Tests , 2017, ICML.

[34]  Aarti Singh,et al.  Differentially private subspace clustering , 2015, NIPS.

[35]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[36]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[37]  Maria-Florina Balcan,et al.  Differentially Private Clustering in High-Dimensional Euclidean Spaces , 2017, ICML.

[38]  A. Clark,et al.  Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants , 2012, Science.

[39]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[40]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[41]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[42]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[43]  Maciej Skorski,et al.  Renyi Entropy Estimation Revisited , 2017, APPROX-RANDOM.

[44]  Moritz Hardt,et al.  The Noisy Power Method: A Meta Algorithm with Applications , 2013, NIPS.

[45]  Himanshu Tyagi,et al.  Test without Trust: Optimal Locally Private Distribution Testing , 2018, AISTATS.

[46]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[47]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[48]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[49]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Daniel Kifer,et al.  A New Class of Private Chi-Square Tests , 2016, ArXiv.

[51]  Bonnie Berger,et al.  Enabling Privacy Preserving GWAS in Heterogeneous Human Populations , 2016, RECOMB.

[52]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[53]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[54]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[55]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[56]  Stephen E. Fienberg,et al.  Privacy-Preserving Data Sharing for Genome-Wide Association Studies , 2012, J. Priv. Confidentiality.

[57]  Jun Sakuma,et al.  Minimax optimal estimators for additive scalar functionals of discrete distributions , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[58]  Kunal Talwar,et al.  On differentially private low rank approximation , 2013, SODA.

[59]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.