Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics

Background Like many scientific fields, epidemiology is addressing issues of research reproducibility. Spatial epidemiology, which often uses the inherently identifiable variable of participant address, must balance reproducibility with participant privacy. In this study, we assess the impact of several different data perturbation methods on key spatial statistics and patient privacy. Methods We analyzed the impact of perturbation on spatial patterns in the full set of address-level mortality data from Lawrence, MA during the period from 1911 to 1913. The original death locations were perturbed using seven different published approaches to stochastic and deterministic spatial data anonymization. Key spatial descriptive statistics were calculated for each perturbation, including changes in spatial pattern center, Global Moran’s I, Local Moran’s I, distance to the k-th nearest neighbors, and the L-function (a normalized form of Ripley’s K). A spatially adapted form of k-anonymity was used to measure the privacy protection conferred by each method, and its compliance with HIPAA and GDPR privacy standards. Results Random perturbation at 50 m, donut masking between 5 and 50 m, and Voronoi masking maintain the validity of descriptive spatial statistics better than other perturbations. Grid center masking with both 100 × 100 and 250 × 250 m cells led to large changes in descriptive spatial statistics. None of the perturbation methods adhered to the HIPAA standard that all points have a k-anonymity > 10. All other perturbation methods employed had at least 265 points, or over 6%, not adhering to the HIPAA standard. Conclusions Using the set of published perturbation methods applied in this analysis, HIPAA and GDPR compliant de-identification was not compatible with maintaining key spatial patterns as measured by our chosen summary statistics. Further research should investigate alternate methods to balancing tradeoffs between spatial data privacy and preservation of key patterns in public health data that are of scientific and medical importance.

[1]  Georges Voronoi Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxième mémoire. Recherches sur les parallélloèdres primitifs. , 1908 .

[2]  P. Moran Notes on continuous stochastic phenomena. , 1950, Biometrika.

[3]  G. Rushton,et al.  Geographically masking health data to preserve confidentiality. , 1999, Statistics in medicine.

[4]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[6]  Irene Casas,et al.  Protection of Geoprivacy and Accuracy of Spatial Information: How Effective Are Geographical Masks? , 2004, Cartogr. Int. J. Geogr. Inf. Geovisualization.

[7]  R. Ostfeld,et al.  Spatial epidemiology: an emerging (or re-emerging) discipline. , 2005, Trends in ecology & evolution.

[8]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[9]  C. del Rio,et al.  Spatial Clustering of HIV Prevalence in Atlanta, Georgia and Population Characteristics Associated with Case Concentrations , 2011, Journal of Urban Health.

[10]  William B Allshouse,et al.  Practice of Epidemiology Mapping Health Data: Improved Privacy Protection With Donut Method Geomasking , 2010 .

[11]  L. Anselin Local Indicators of Spatial Association—LISA , 2010 .

[12]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[13]  William B Allshouse,et al.  Geomasking sensitive health data and privacy protection: an evaluation using an E911 database , 2010, Geocarto international.

[14]  Sharon E. Edwards,et al.  Race, socioeconomic status, and air pollution exposure in North Carolina. , 2013, Environmental research.

[15]  Hai-Ying Liu,et al.  Mobile phone tracking: in support of modelling traffic-related air pollution contribution to individual exposure and its implications for public health impact assessment , 2013, Environmental Health.

[16]  P. Zandbergen Ensuring Confidentiality of Geocoded Health Data: Assessing Geographic Masking Strategies for Individual-Level Data , 2014, Advances in medicine.

[17]  Roger D. Peng,et al.  The reproducibility crisis in science: A statistical counterattack , 2015 .

[18]  B. Fitzgerald Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule , 2015 .

[19]  Dara E. Seidl,et al.  Spatial obfuscation methods for privacy protection of household-level data , 2015 .

[20]  Cheng Tang,et al.  On Lloyd's Algorithm: New Theoretical Insights for Clustering in Practice , 2016, AISTATS.

[21]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[22]  Robbie C. M. van Aert,et al.  Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking , 2016, Front. Psychol..

[23]  Michael Leitner,et al.  Adaptive areal elimination (AAE): A transparent way of disclosing protected spatial datasets , 2016, Comput. Environ. Urban Syst..

[24]  S. Freundschuh,et al.  The location swapping method for geomasking , 2017 .

[25]  Willem G. van Panhuis,et al.  Spatial clustering of measles vaccination coverage among children in sub-Saharan Africa , 2017, BMC Public Health.

[26]  Wayne Richter,et al.  The verified neighbor approach to geoprivacy: An improved method for geographic masking , 2018, Journal of Exposure Science and Environmental Epidemiology.

[27]  J. Zelner,et al.  Racial Disparities in Coronavirus Disease 2019 (COVID-19) Mortality Are Driven by Unequal Infection Risks , 2020, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[28]  Ramesh Raskar,et al.  Apps Gone Rogue: Maintaining Personal Privacy in an Epidemic , 2020, ArXiv.

[29]  J. Zelner,et al.  Racial disparities in COVID-19 mortality are driven by unequal infection risks. , 2020, medRxiv.

[30]  E. Dong,et al.  An interactive web-based dashboard to track COVID-19 in real time , 2020, The Lancet Infectious Diseases.