Individual privacy versus public good: protecting confidentiality in health research

Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencies. Such data hold a wealth of information vital to effective health policy development and evaluation, as well as to enhanced clinical care through evidence-based practice and safety and quality monitoring. These initiatives are aimed at improving individuals' health and well-being. Nevertheless, analyses of health data archives must be conducted in such a way that individuals' privacy is not compromised. One important aspect of protecting individuals' privacy is protecting the confidentiality of their data. It is the purpose of this paper to provide a review of a number of approaches to reducing disclosure risk when making data available for research, and to present a taxonomy for such approaches. Some of these methods are widely used, whereas others are still in development. It is important to have a range of methods available because there is also a range of data-use scenarios, and it is important to be able to choose between methods suited to differing scenarios. In practice, it is necessary to find a balance between allowing the use of health and medical data for research and protecting confidentiality. This balance is often presented as a trade-off between disclosure risk and data utility, because methods that reduce disclosure risk, in general, also reduce data utility.

[1]  Stephen E. Fienberg,et al.  Scalable privacy-preserving data sharing methodology for genome-wide association studies , 2014, J. Biomed. Informatics.

[2]  George T. Duncan,et al.  Enhancing Access to Microdata while Protecting Confidentiality: Prospects for the Future , 1991 .

[3]  R. Lyons,et al.  The SAIL Databank: building a national architecture for e-health research and evaluation , 2009, BMC health services research.

[4]  Jerome P. Reiter,et al.  Multiple imputation for missing data via sequential regression trees. , 2010, American journal of epidemiology.

[5]  Kenneth D Mandl,et al.  Privacy protection versus cluster detection in spatial epidemiology. , 2006, American journal of public health.

[6]  Josep Domingo-Ferrer,et al.  Statistical Disclosure Control: Hundepool/Statistical Disclosure Control , 2012 .

[7]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[8]  Ivan P. Fellegi,et al.  On the Question of Statistical Confidentiality , 1972 .

[9]  Josep Domingo-Ferrer,et al.  A Critique of the Sensitivity Rules Usually Employed for Statistical Table Protection , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[10]  Stephanie Bartee,et al.  The public-use National Health Interview Survey linked mortality files: methods of reidentification risk avoidance and comparative analysis. , 2008, American journal of epidemiology.

[11]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[12]  Natalie Shlomo,et al.  Applicability of Confidentiality Methods to Personal and Business Data , 2014, Privacy in Statistical Databases.

[13]  Jerome P. Reiter Statistical Approaches To Protecting Confidentiality For Microdata And Their Effects On The Quality Of Statistical Inferences , 2012 .

[14]  Jerome P. Reiter,et al.  Model Diagnostics for Remote Access Regression Servers , 2003, Stat. Comput..

[15]  Dale A. Robertson,et al.  Cell Suppression: Experience and Theory , 2002, Inference Control in Statistical Databases.

[16]  Anna Oganian,et al.  Verification servers: Enabling analysts to assess the quality of inferences from public use data , 2009, Comput. Stat. Data Anal..

[17]  Mark Westcott,et al.  Protecting confidentiality in statistical analysis outputs from a virtual data centre , 2013 .

[18]  Miguel A. Hernán,et al.  Epidemiology, data sharing, and the challenge of scientific replication. , 2009, Epidemiology.

[19]  Thomas A. Louis,et al.  A smoothing approach for masking spatial data , 2010, 1011.3367.

[20]  J. Marc Overhage,et al.  Application of Information Technology: A Context-sensitive Approach to Anonymizing Spatial Surveillance Data: Impact on Outbreak Detection , 2006, J. Am. Medical Informatics Assoc..

[21]  Andrew Curtis,et al.  Confidentiality risks in fine scale aggregations of health data , 2011, Comput. Environ. Urban Syst..

[22]  Di An,et al.  A multiple imputation approach to disclosure limitation for high-age individuals in longitudinal studies. , 2010, Statistics in medicine.

[23]  Alan F. Karr,et al.  Table servers protect confidentiality in tabular data releases , 2003, CACM.

[24]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[25]  P. Embí,et al.  Toward Reuse of Clinical Data for Research and Quality Improvement: The End of the Beginning? , 2009, Annals of Internal Medicine.

[26]  James O. Chipperfield,et al.  A Summary of Attack Methods and Confidentiality Protection Measures for Fully Automated Remote Analysis Systems , 2013 .

[27]  K. El Emam,et al.  Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[28]  K. Emam Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[29]  Natalie Shlomo,et al.  Assessing Identification Risk in Survey Microdata Using Log-Linear Models , 2008 .

[30]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[31]  Giuseppe Porro,et al.  Missing data imputation, matching and other applications of random recursive partitioning , 2007, Comput. Stat. Data Anal..

[32]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[33]  A. J. Bass,et al.  Population‐based linkage of health records in Western Australia: development of a health services research linked database , 1999, Australian and New Zealand journal of public health.

[34]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[35]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[36]  Christine M. O'Keefe,et al.  Regression output from a remote analysis server , 2009, Data Knowl. Eng..

[37]  Eran Halperin,et al.  Identifying Personal Genomes by Surname Inference , 2013, Science.

[38]  Bradley Malin,et al.  Biomedical data privacy: problems, perspectives, and recent advances , 2013, J. Am. Medical Informatics Assoc..

[39]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[40]  M. Elliot,et al.  A Case Study of the Impact of Statistical Disclosure Control on Data Quality in the Individual UK Samples of Anonymised Records , 2007 .

[41]  W. Winkler Examples of Easy-to-implement, Widely Used Methods of Masking for which Analytic Properties are not Justified , 2008 .

[42]  A Wajda,et al.  Record Linkage Strategies , 1991, Methods of Information in Medicine.

[43]  M. Rothstein Is Deidentification Sufficient to Protect Health Privacy in Research? , 2010, The American journal of bioethics : AJOB.

[44]  Damien McAullay,et al.  Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics® , 2008, Comput. Methods Programs Biomed..

[45]  Stephen E. Fienberg,et al.  Differential Privacy for Protecting Multi-dimensional Contingency Table Data: Extensions and Applications , 2012, J. Priv. Confidentiality.

[46]  Jerome P. Reiter,et al.  Categorical data regression diagnostics for remote access servers , 2005 .

[47]  Jerome P Reiter,et al.  Commentary: Sharing Confidential Data for Research Purposes: A Primer , 2011, Epidemiology.

[48]  Christine M. O'Keefe Privacy and the Use of Health Data - Reducing Disclosure Risk , 2008 .

[49]  Stephen E. Fienberg,et al.  Confidentiality and Disclosure Limitation , 2005 .

[50]  Michael Leitner,et al.  Cartographic Guidelines for Geographically Masking the Locations of Confidential Point Data , 2004 .

[51]  Julia Lane,et al.  Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances , 2006, Privacy in Statistical Databases.

[52]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[53]  Alan F. Karr,et al.  Web-Based Systems that Disseminate Information from Databases but Protect Confidentiality , 2002, Advances in Digital Government.

[54]  Takanori Hirose,et al.  Preface to special issue , 2014, Brain Tumor Pathology.

[55]  Jerome P. Reiter,et al.  Data Dissemination and Disclosure Limitation in a World Without Microdata: A Risk-Utility Framework for Remote Access Analysis Servers , 2005 .

[56]  Latanya Sweeney,et al.  Matching Known Patients to Health Records in Washington State Data , 2013, ArXiv.

[57]  Cynthia Dwork,et al.  Differential Privacy for Statistics: What we Know and What we Want to Learn , 2010, J. Priv. Confidentiality.

[58]  A. Karr,et al.  Web-Based Systems that Disseminate Information but Protect Confidential Data , 2001 .

[59]  Richard A. Gibbs,et al.  No Longer De-Identified , 2006, Science.

[60]  Peter Kooiman,et al.  Post randomisation for statistical disclosure control: Theory and implementation , 1997 .

[61]  Laura Zayatz,et al.  The Microdata Analysis System at the U.S. Census Bureau , 2010, Privacy in Statistical Databases.

[62]  M. Boulos,et al.  Musings on privacy issues in health research involving disaggregate geographic data about individuals , 2009, International journal of health geographics.

[63]  Jonathan M. Samet,et al.  Data: to share or not to share? , 2009, Epidemiology.

[64]  Leah K VanWey,et al.  Confidentiality and spatially explicit data: Concerns and challenges , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Roderick J. A. Little,et al.  Multiple imputation: an alternative to top coding for statistical disclosure control , 2007 .

[66]  C. Skinner,et al.  A measure of disclosure risk for microdata , 2002 .

[67]  B. Fitzgerald Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule , 2015 .

[68]  C. Skinner,et al.  The case for samples of anonymized records from the 1991 census. , 1991, Journal of the Royal Statistical Society. Series A,.

[69]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[70]  Alan F. Karr,et al.  Risk‐Utility Paradigms for Statistical Disclosure Limitation: How to Think, But Not How to Act , 2011 .

[71]  Marley,et al.  A Method for Confidentialising User-Defined Tables: Statistical Properties and a Risk-Utility Analysis , 2011 .

[72]  Jerome P. Reiter,et al.  MULTIPLE IMPUTATION FOR SHARING PRECISE GEOGRAPHIES IN PUBLIC USE DATA. , 2012, The annals of applied statistics.

[73]  D. Rock,et al.  Healthy babies for mothers with serious mental illness: a case management framework for mental health clinicians. , 2008, International journal of mental health nursing.

[74]  L. Cox Linear sensitivity measures in statistical disclosure control , 1981 .

[75]  John M. Abowd,et al.  Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project , 2006 .

[76]  Christine M. O'Keefe,et al.  Comparison of Two Remote Access Systems Recently Developed and Implemented in Australia , 2014, Privacy in Statistical Databases.

[77]  S. Zubrick,et al.  Pregnancy, delivery, and neonatal complications in a population cohort of women with schizophrenia and major affective disorders. , 2005, The American journal of psychiatry.

[78]  S. Fienberg Statistical Perspectives on Conÿdentiality and Data Access in Public Health , 2022 .

[79]  Jörg Drechsler,et al.  Multiple imputation in practice—a case study using a complex German establishment survey , 2011 .

[80]  Jerome P. Reiter,et al.  Towards Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database , 2011 .

[81]  D. Levy,et al.  Genome-Wide Scan for Pulse Pressure in the National Heart, Lung and Blood Institute’s Framingham Heart Study , 2004, Hypertension.

[82]  Paul Ohm Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization , 2009 .

[83]  Rathindra Sarathy,et al.  Evaluating Laplace Noise Addition to Satisfy Differential Privacy for Numeric Data , 2011, Trans. Data Priv..

[84]  George T. Duncan,et al.  Disclosure-Limited Data Dissemination , 1986 .

[85]  Andrew D. Johnson,et al.  Temporal Trends in Results Availability from Genome-Wide Association Studies , 2011, PLoS genetics.

[86]  Deven McGraw,et al.  Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data , 2013, J. Am. Medical Informatics Assoc..

[87]  Christine M OˈKeefe,et al.  Privacy and the use of health data for research , 2010, The Medical journal of Australia.

[88]  Lisa Singh,et al.  The Current Stage of the Microdata Analysis System at the U . S . Census Bureau , 2011 .

[89]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[90]  Zhen Lin,et al.  Using binning to maintain confidentiality of medical data , 2002, AMIA.