A Differentially Private Kernel Two-Sample Test

Kernel two-sample testing is a useful statistical tool in determining whether data samples arise from different distributions without imposing any parametric assumptions on those distributions. However, raw data samples can expose sensitive information about individuals who participate in scientific studies, which makes the current tests vulnerable to privacy breaches. Hence, we design a new framework for kernel two-sample testing conforming to differential privacy constraints, in order to guarantee the privacy of subjects in the data. Unlike existing differentially private parametric tests that simply add noise to data, kernel-based testing imposes a challenge due to a complex dependence of test statistics on the raw data, as these statistics correspond to estimators of distances between representations of probability measures in Hilbert spaces. Our approach considers finite dimensional approximations to those representations. As a result, a simple chi-squared test is obtained, where a test statistic depends on a mean and covariance of empirical differences between the samples, which we perturb for a privacy guarantee. We investigate the utility of our framework in two realistic settings and conclude that our method requires only a relatively modest increase in sample size to achieve a similar level of power to the non-private tests in both settings.

[1]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[2]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[3]  John P. Cunningham,et al.  Bayesian Learning of Kernel Embeddings , 2016, UAI.

[4]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[5]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[6]  Zaïd Harchaoui,et al.  A Fast, Consistent Kernel Two-Sample Test , 2009, NIPS.

[7]  Amit Sahai,et al.  Do Distributed Differentially-Private Protocols Require Oblivious Transfer? , 2016, ICALP.

[8]  Marco Gaboardi,et al.  Local Private Hypothesis Testing: Chi-Square Tests , 2017, ICML.

[9]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[10]  Prateek Jain,et al.  Differentially Private Learning with Kernels , 2013, ICML.

[11]  Larry A. Wasserman,et al.  Differential privacy for functions and functional data , 2012, J. Mach. Learn. Res..

[12]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[13]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[14]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[15]  Janine E. Deakin,et al.  The Evolution of Epigenetic Regulators CTCF and BORIS/CTCFL in Amniotes , 2008, PLoS genetics.

[16]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[17]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[18]  Yu-Xiang Wang,et al.  Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising , 2018, ICML.

[19]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[20]  Daniel Kifer,et al.  A New Class of Private Chi-Square Hypothesis Tests , 2017, AISTATS.

[21]  Arthur Gretton,et al.  Interpretable Distribution Features with Maximum Testing Power , 2016, NIPS.

[22]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[23]  Bernhard Schölkopf,et al.  Differentially Private Database Release via Kernel Mean Embeddings , 2017, ICML.

[24]  Li Zhang,et al.  Analyze gauss: optimal bounds for privacy-preserving principal component analysis , 2014, STOC.

[25]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[26]  Toniann Pitassi,et al.  The Limits of Two-Party Differential Privacy , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[27]  Dino Sejdinovic,et al.  Bayesian Approaches to Distribution Regression , 2017, AISTATS.

[28]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[29]  Ryan M. Rogers,et al.  Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing , 2016, ICML 2016.

[30]  Luc Van Gool,et al.  Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks , 2016, International Journal of Computer Vision.