Differentially Private Survey Research ∗

Survey researchers have long protected the privacy of respondents via de-identification (removing names and other directly identifying information) before sharing data. Although these procedures help, recent research demonstrates that they fail to protect respondents from intentional re-identification attacks, a problem that threatens to undermine vast survey enterprises in academia, government, and industry. This is especially a problem in political science because political beliefs are not merely the subject of our scholarship; they represent some of the most important information respondents want to keep private. We confirm the problem in practice by re-identifying individuals from a survey about a controversial referendum declaring life beginning at conception. We build on the concept of “differential privacy” to offer new data sharing procedures with mathematical guarantees for protecting respondent privacy and statistical validity guarantees for social scientists analyzing differentially private data. The cost of these new procedures is larger standard errors, which can be overcome with somewhat larger sample sizes.

[1]  G. King,et al.  Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset , 2022, Political Analysis.

[2]  Harrison Quick,et al.  Generating Poisson‐distributed differentially private synthetic data , 2019, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[3]  Patrick Sturgis,et al.  The demise of the survey? A research note on trends in the use of survey data in the social sciences, 1939 to 2015 , 2020, International Journal of Social Research Methodology.

[4]  Jordan Awan,et al.  One Step to Efficient Synthetic Data , 2020, ArXiv.

[5]  Úlfar Erlingsson,et al.  Encode, Shuffle, Analyze Privacy Revisited: Formalizations and Empirical Evaluation , 2020, ArXiv.

[6]  William K. C. Lam,et al.  Differentially Private SQL with Bounded User Contribution , 2019, Proc. Priv. Enhancing Technol..

[7]  M. Papathomas,et al.  On the correspondence of deviances and maximum-likelihood and interval estimates from log-linear to logistic regression modelling , 2017, Royal Society Open Science.

[8]  G. King,et al.  A New Model for Industry–Academic Partnerships , 2019, PS: Political Science & Politics.

[9]  Garret Christensen,et al.  Transparent and Reproducible Social Science Research , 2019 .

[10]  E. Plutzer Privacy, Sensitive Questions, and Informed Consent , 2019, Public Opinion Quarterly.

[11]  Yanna Krupnikov,et al.  How Transparency Affects Survey Responses , 2019, Public Opinion Quarterly.

[12]  Borja Balle,et al.  Differentially Private Summation with Multi-Message Shuffling , 2019, ArXiv.

[13]  Fang Liu,et al.  Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints , 2016, Trans. Data Priv..

[14]  Abhradeep Thakurta,et al.  Statistically Valid Inferences from Privacy-Protected Data , 2023, American Political Science Review.

[15]  Cynthia Dwork,et al.  The Fienberg Problem: How to Allow Human Interactive Data Analysis in the Age of Differential Privacy , 2018, J. Priv. Confidentiality.

[16]  Thomas Steinke,et al.  Differential Privacy: A Primer for a Non-Technical Audience , 2018 .

[17]  Janardhan Kulkarni,et al.  Collecting Telemetry Data Privately , 2017, NIPS.

[18]  Jun Tang,et al.  Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 , 2017, ArXiv.

[19]  Gary King,et al.  A Unified Approach to Measurement Error and Missing Data: Overview and Applications , 2017 .

[20]  Salil P. Vadhan,et al.  The Complexity of Differential Privacy , 2017, Tutorials on the Foundations of Cryptography.

[21]  Sheridan Jeary,et al.  Re-identification attacks - A systematic literature review , 2016, Int. J. Inf. Manag..

[22]  Kosuke Imai,et al.  An Empirical Validation Study of Popular Survey Methodologies for Sensitive Questions , 2016 .

[23]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[24]  Kosuke Imai,et al.  Design and Analysis of the Randomized Response Technique , 2015 .

[25]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[26]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[27]  G. King,et al.  Causal Inference without Balance Checking: Coarsened Exact Matching , 2012, Political Analysis.

[28]  John P. Buonaccorsi,et al.  Measurement Error: Models, Methods, and Applications , 2010 .

[29]  Moni Naor,et al.  Pan-Private Streaming Algorithms , 2010, ICS.

[30]  A. Desmond Estimating Equations, Theory of , 2006 .

[31]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[32]  J. T. Wulu,et al.  Regression analysis of count data , 2002 .

[33]  G. King,et al.  Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation , 2001, American Political Science Review.

[34]  D. Oakes Direct calculation of the information matrix via the EM , 1999 .

[35]  Ronald Christensen,et al.  Log-Linear Models and Logistic Regression , 1997 .

[36]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[37]  A. Agresti An introduction to categorical data analysis , 1997 .

[38]  G. King,et al.  Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator , 1989 .

[39]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.