El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients

Background There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. Objective To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Methods We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. Results An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. Conclusions It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

[1]  Alexander La,et al.  Access to social security microdata files for research and statistical purposes. , 1978, Social security bulletin.

[2]  S E Fienberg,et al.  Sharing statistical data in the biomedical and health sciences: ethical, institutional, legal, and professional dimensions. , 1994, Annual review of public health.

[3]  Joshua C Denny,et al.  Anonymization of administrative billing codes with repeated diagnoses through censoring. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[4]  Sushil Jajodia,et al.  Secure Data Management in Decentralized Systems , 2014, Secure Data Management in Decentralized Systems.

[5]  D. Altman,et al.  Towards agreement on best practice for publishing raw clinical trial data , 2009, Trials.

[6]  C. Mackenzie,et al.  A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. , 1987, Journal of chronic diseases.

[7]  Siddharth Srivastava,et al.  Anonymizing Social Networks , 2007 .

[8]  Khaled El Emam,et al.  De-identifying a public use microdata file from the Canadian national discharge abstract database , 2011, BMC Medical Informatics Decis. Mak..

[9]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[10]  Carol Marie Perry Archiving of publicly funded research data: A survey of Canadian researchers , 2008, Gov. Inf. Q..

[11]  Emmett Flemming,et al.  NCES Statistical Standards. , 1992 .

[12]  Chris Clifton,et al.  Multirelational k-Anonymity , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  P. Sztompka,et al.  Trust in Science , 2007 .

[14]  J. Kirwan,et al.  Making original data from clinical studies available for alternative analysis. , 1997, The Journal of rheumatology.

[15]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[16]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[17]  Ag De Waal,et al.  A view on statistical disclosure control for microdata , 1996 .

[18]  Qing Zhang,et al.  Anonymizing bipartite graph data using safe groupings , 2008, The VLDB Journal.

[19]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[20]  Jian Pei,et al.  Publishing Sensitive Transactions for Itemset Utility , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[21]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[23]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[24]  Ke Wang,et al.  Anonymizing Transaction Data by Integrating Suppression and Generalization , 2010, PAKDD.

[25]  T. Hedrick Justifications for the sharing of social science data , 1988 .

[26]  Benjamin C. M. Fung,et al.  Walking in the crowd: anonymizing trajectory data for pattern analysis , 2009, CIKM.

[27]  Gabriel J. Escobar,et al.  Risk-Adjusting Hospital Inpatient Mortality Using Automated Inpatient, Outpatient, and Laboratory Databases , 2008, Medical care.

[28]  S. Fienberg,et al.  Sharing research data , 1985 .

[29]  Khaled El Emam,et al.  A method for evaluating marketer re-identification risk , 2010, EDBT '10.

[30]  Joan E. Sieber,et al.  Data sharing , 1988 .

[31]  Rajeev Motwani,et al.  Link Privacy in Social Networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[32]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[33]  Robyn Tamblyn,et al.  Rare Visible disorders/Diseases as Individually Identifiable Health Information , 2005, AMIA.

[34]  Barbara Stanley,et al.  Data sharing , 1988 .

[35]  Ruggero G. Pensa,et al.  Pattern-Preserving k-Anonymization of Sequences and its Application to Mobil- ity Data Mining , 2008, PiLBA.

[36]  G. Duncan,et al.  Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics , 1993 .

[37]  H. Quan,et al.  Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data , 2005, Medical care.

[38]  David Buckeridge,et al.  Physician privacy concerns when disclosing patient data for public health purposes during a pandemic influenza outbreak , 2011, BMC public health.

[39]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[40]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[41]  A. Solow,et al.  Measuring biological diversity , 2006, Environmental and Ecological Statistics.

[42]  W. Bossert,et al.  The Measurement of Diversity , 2001 .

[43]  K. Emam,et al.  Evaluating the Risk of Re-identification of Patients from Hospital Prescription Records. , 2009, The Canadian journal of hospital pharmacy.

[44]  Khaled El Emam,et al.  Heuristics for De-identifying Health Data , 2008, IEEE Secur. Priv..

[45]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[46]  Greg Tananbaum Adventures in open data , 2008, Learn. Publ..

[47]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[48]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[49]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[50]  Peter Murray-Rust,et al.  Open Data in Science , 2008 .