Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset

We offer methods to analyze the “differentially private” Facebook URLs Dataset which, at over 40 trillion cell values, is one of the largest social science research datasets ever constructed. The version of differential privacy used in the URLs dataset has specially calibrated random noise added, which provides mathematical guarantees for the privacy of individual research subjects while still making it possible to learn about aggregate patterns of interest to social scientists. Unfortunately, random noise creates measurement error which induces statistical bias—including attenuation, exaggeration, switched signs, or incorrect uncertainty estimates. We adapt methods developed to correct for naturally occurring measurement error, with special attention to computational efficiency for large datasets. The result is statistically valid linear regression estimates and descriptive statistics that can be interpreted as ordinary analyses of nonconfidential data but with appropriately larger standard errors.

[1]  S L Warner,et al.  Randomized response: a survey technique for eliminating evasive answer bias. , 1965, Journal of the American Statistical Association.

[2]  W. Fuller,et al.  An Errors-In-Variables Analysis of Managerial Role Performance , 1974 .

[3]  Frantisek Stulajter,et al.  Nonlinear estimators of polynomials in mean values of a Gaussian stochastic process , 1978, Kybernetika.

[4]  G. King,et al.  Variance Specification in Event Count Models: From Restrictive Assumptions to a Generalized Estimator , 1989 .

[5]  Jianqing Fan On the Optimal Rates of Convergence for Nonparametric Deconvolution Problems , 1991 .

[6]  A. Goldberger A course in econometrics , 1991 .

[7]  Gary King,et al.  The Generalization in the Generalized Event Count Model, with Comments on Achen, Amato, and Londregan , 1996, Political Analysis.

[8]  L Sweeney,et al.  Weaving Technology and Policy Together to Maintain Confidentiality , 1997, Journal of Law, Medicine & Ethics.

[9]  L. Stefanski Measurement Error Models , 2000 .

[10]  Jason Wittenberg,et al.  Making the Most Of Statistical Analyses: Improving Interpretation and Presentation , 2000 .

[11]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[12]  R. Mnatsakanov Hausdorff moment problem: Reconstruction of distributions , 2008 .

[13]  John P. Buonaccorsi,et al.  Measurement Error: Models, Methods, and Applications , 2010 .

[14]  Frank McSherry,et al.  Probabilistic Inference and Differential Privacy , 2010, NIPS.

[15]  Marie Davidian,et al.  A Moment‐Adjusted Imputation Method for Measurement Error Models , 2011, Biometrics.

[16]  Adam D. Smith,et al.  Privacy-preserving statistical estimation with optimal convergence rates , 2011, STOC '11.

[17]  Adam Glynn What Can We Learn with Statistical Truth Serum?Design and Analysis of the List Experiment , 2013 .

[18]  F. Rubio,et al.  On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables , 2013 .

[19]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[20]  Yue Wang,et al.  Differentially Private Hypothesis Testing, Revisited , 2015, ArXiv.

[21]  Kosuke Imai,et al.  Design and Analysis of the Randomized Response Technique , 2015 .

[22]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[23]  Thomas Steinke,et al.  Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds , 2016, TCC.

[24]  Ryan M. Rogers,et al.  Differentially Private Chi-Squared Hypothesis Testing: Goodness of Fit and Independence Testing , 2016, ICML 2016.

[25]  Eitan Hersh,et al.  The Primacy of Race in the Geography of Income‐Based Voting: New Evidence from Public Voting Records , 2016 .

[26]  Salil P. Vadhan,et al.  The Complexity of Differential Privacy , 2017, Tutorials on the Foundations of Cryptography.

[27]  Or Sheffet,et al.  Differentially Private Ordinary Least Squares , 2015, ICML.

[28]  Gary King,et al.  A Unified Approach to Measurement Error and Missing Data: Overview and Applications , 2017 .

[29]  Vishesh Karwa,et al.  Finite Sample Differentially Private Confidence Intervals , 2017, ITCS.

[30]  Xiao-Li Meng,et al.  Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election , 2018, The Annals of Applied Statistics.

[31]  Simson L. Garfinkel,et al.  Issues Encountered Deploying Differential Privacy , 2018, WPES@CCS.

[32]  Abhradeep Thakurta,et al.  Statistically Valid Inferences from Privacy-Protected Data , 2023, American Political Science Review.

[33]  David Evans,et al.  Evaluating Differentially Private Machine Learning in Practice , 2019, USENIX Security Symposium.

[34]  Ruobin Gong Exact Inference with Approximate Computation for Differentially Private Data via Perturbations , 2019 .

[35]  G. King,et al.  A New Model for Industry–Academic Partnerships , 2019, PS: Political Science & Politics.

[36]  Ashwin Machanavajjhala,et al.  Differentially Private Significance Tests for Regression Coefficients , 2017, Journal of Computational and Graphical Statistics.

[37]  Yue Wang,et al.  Differentially Private Confidence Intervals for Empirical Risk Minimization , 2018, J. Priv. Confidentiality.

[38]  Frauke Kreuter,et al.  Differential Privacy and Social Science: An Urgent Puzzle , 2020, 2.1.

[39]  Royce J. Wilson,et al.  Google COVID-19 Community Mobility Reports: Anonymization Process Description (version 1.0) , 2020, ArXiv.

[40]  Maria Huhtala,et al.  Random Variables and Stochastic Processes , 2021, Matrix and Tensor Decompositions in Signal Processing.