Improving the Utility of Poisson-Distributed, Differentially Private Synthetic Data via Prior Predictive Truncation with an Application to CDC WONDER

CDC WONDER is a web-based tool for the dissemination of epidemiologic data collected by the National Vital Statistics System. While CDC WONDER has built-in privacy protections, they do not satisfy formal privacy protections such as differential privacy and thus are susceptible to targeted attacks. Given the importance of making high-quality public health data publicly available while preserving the privacy of the underlying data subjects, we aim to improve the utility of a recently developed approach for generating Poisson-distributed, differentially private synthetic data by using publicly available information to truncate the range of the synthetic data. Specifically, we utilize county-level population information from the U.S. Census Bureau and national death reports produced by the CDC to inform prior distributions on county-level death rates and infer reasonable ranges for Poisson-distributed, county-level death counts. In doing so, the requirements for satisfying differential privacy for a given privacy budget can be reduced by several orders of magnitude, thereby leading to substantial improvements in utility. To illustrate our proposed approach, we consider a dataset comprised of over 26,000 cancer-related deaths from the Commonwealth of Pennsylvania belonging to over 47,000 combinations of cause-of-death and demographic variables such as age, race, sex, and county-of-residence and demonstrate the proposed framework’s ability to preserve features such as geographic, urban/rural, and racial disparities present in the true data.

[1]  Donna L. Hoyert,et al.  Vital Statistics of the United States , 1940, Nature.

[2]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Yongli Xi,et al.  Recent changes in drug poisoning mortality in the United States by urban–rural status and by drug type , 2008, Pharmacoepidemiology and drug safety.

[4]  Anne-Sophie Charest,et al.  How Can We Analyze Differentially-Private Synthetic Datasets? , 2011, J. Priv. Confidentiality.

[5]  John M. Abowd,et al.  The U.S. Census Bureau Adopts Differential Privacy , 2018, KDD.

[6]  A. Gelfand,et al.  Proper multivariate conditional autoregressive models for spatial data analysis. , 2003, Biostatistics.

[7]  Harrison Quick,et al.  Using spatiotemporal models to generate synthetic data for public use. , 2018, Spatial and spatio-temporal epidemiology.

[8]  Alan F. Karr,et al.  Bayesian Multiscale Multiple Imputation With Implications for Data Confidentiality , 2010 .

[9]  J. Besag,et al.  Bayesian image restoration, with two applications in spatial statistics , 1991 .

[10]  Tim Roughgarden,et al.  Universally utility-maximizing privacy mechanisms , 2008, STOC '09.

[11]  Harrison Quick,et al.  Generating Poisson‐distributed differentially private synthetic data , 2019, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[12]  W. McMillen : Vital Statistics of the United States, 1950 , 1955 .

[13]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[14]  D. Brillinger,et al.  The natural variability of vital rates and associated statistics. , 1986, Biometrics.

[15]  E. Arias,et al.  Deaths: Final Data for 2017. , 2019, National vital statistics reports : from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System.

[16]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[17]  E. Arias,et al.  United States Census 2000 population with bridged race categories. , 2003, Vital and health statistics. Series 2, Data evaluation and methods research.

[18]  G. Schwartz,et al.  Spatial Disparities in Coronavirus Incidence and Mortality in the United States: An Ecological Analysis as of May 2020 , 2020, The Journal of rural health : official journal of the American Rural Health Association and the National Rural Health Care Association.

[19]  Christopher K. Wikle,et al.  Zeros and ones: a case for suppressing zeros in sensitive count data with an application to stroke mortality , 2015 .