Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing

Hypothesis testing is a useful statistical tool in determining whether a given model should be rejected based on a sample from the population. Sample data may contain sensitive information about individuals, such as medical information. Thus it is important to design statistical tests that guarantee the privacy of subjects in the data. In this work, we study hypothesis testing subject to differential privacy, specifically chi-squared tests for goodness of fit for multinomial data and independence between two categorical variables. We propose new tests for goodness of fit and independence testing that like the classical versions can be used to determine whether a given model should be rejected or not, and that additionally can ensure differential privacy. We give both Monte Carlo based hypothesis tests as well as hypothesis tests that more closely follow the classical chi-squared goodness of fit test and the Pearson chi-squared test for independence. Crucially, our tests account for the distribution of the noise that is injected to ensure privacy in determining significance. We show that these tests can be used to achieve desired significance levels, in sharp contrast to direct applications of classical tests to differentially private contingency tables which can result in wildly varying significance levels. Moreover, we study the statistical power of these tests. We empirically show that to achieve the same level of power as the classical non-private tests our new tests need only a relatively modest increase in sample size.

[1]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[2]  Yue Wang,et al.  Differentially Private Hypothesis Testing, Revisited , 2015, ArXiv.

[3]  Yue Wang,et al.  Maximum Likelihood Postprocessing for Differential Privacy under Consistency Constraints , 2015, KDD.

[4]  Fei Yu,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge , 2014, BMC Medical Informatics and Decision Making.

[5]  Marco Gaboardi,et al.  Dual Query: Practical Private Query Release for High Dimensional Data , 2014, ICML.

[6]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[7]  Vitaly Shmatikov,et al.  Privacy-preserving data exploration in genome-wide association studies , 2013, KDD.

[8]  Aleksandra B. Slavkovic,et al.  Differentially Private Graphical Degree Sequences and Synthetic Graphs , 2012, Privacy in Statistical Databases.

[9]  Stephen E. Fienberg,et al.  Privacy-Preserving Data Sharing for Genome-Wide Association Studies , 2012, J. Priv. Confidentiality.

[10]  Gerome Miklau,et al.  An Adaptive Mechanism for Accurate Query Answering under Differential Privacy , 2012, Proc. VLDB Endow..

[11]  Adam D. Smith,et al.  Privacy-preserving statistical estimation with optimal convergence rates , 2011, STOC '11.

[12]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[13]  Stephen E. Fienberg,et al.  Differential Privacy and the Risk-Utility Tradeoff for Multi-dimensional Contingency Tables , 2010, Privacy in Statistical Databases.

[14]  Andrew McGregor,et al.  Optimizing linear counting queries under differential privacy , 2009, PODS.

[15]  Aleksandra B. Slavkovic,et al.  Differential Privacy for Clinical Trial Data: Preliminary Evaluations , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[16]  L. Wasserman,et al.  A Statistical Framework for Differential Privacy , 2008, 0811.2501.

[17]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[18]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[19]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[20]  C. Dwork,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[21]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[22]  Mario F. Triola,et al.  Essentials of Statistics , 2001 .

[23]  Matthew Krain,et al.  Democracy and civil war: A note on the democratic peace proposition , 1997 .

[24]  James M. McCormick,et al.  Economic and Political Explanations of Human Rights Violations , 1988, World Politics.

[25]  W. Marks,et al.  The effect of ABO blood group on the diagnosis of von Willebrand disease. , 1987, Blood.

[26]  Anthony G. Greenwald,et al.  Increasing voting behavior by asking people if they expect to vote. , 1987 .

[27]  T J David,et al.  Asthma and the month of birth , 1985, Clinical allergy.

[28]  James H. Kuklinski,et al.  Economic Expectations and Voting Behavior in United States House and Senate Elections , 1981, American Political Science Review.

[29]  D. Grauman,et al.  Causes of death among laundry and dry cleaning workers. , 1979, American journal of public health.

[30]  H. Ebaugh,et al.  Church Attendance and Attitudes toward Abortion: Differentials in Liberal and Conservative Churches , 1978 .

[31]  W. C. Guenther Power and Sample Size for Approximate Chi-Square Tests , 1977 .

[32]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[33]  D. G. Chapman,et al.  The Power of Chi Square Tests for Contingency Tables , 1966 .

[34]  J. Imhof Computing the distribution of quadratic forms in normal variables , 1961 .

[35]  Brian J. L. Berry,et al.  City Size Distributions and Economic Development , 1961, Economic Development and Cultural Change.

[36]  William A. Glaser,et al.  The Family and Voting Turnout , 1959 .

[37]  Sujit Kumar Mitra,et al.  On the Limiting Power Function of the Frequency Chi-Square Test , 1958 .

[38]  Ali Akbar Mohsenipour,et al.  On the Distribution of Quadratic Expressions in Various Types of Random Vectors , 2012 .

[39]  S. Mitra Contributions to the statistical analysis of categorical data , 1955 .