De-Health: All Your Online Health Information Are Belong to Us

In this paper, we study the privacy of online health data. We present a novel online health data De-Anonymization (DA) framework, named De-Health. Leveraging two real world online health datasets WebMD and HealthBoards, we validate the DA efficacy of De-Health. We also present a linkage attack framework which can link online health/medical information to real world people. Through a proof-of-concept attack, we link 347 out of 2805 WebMD users to real world people, and find the full names, medical/health information, birthdates, phone numbers, and other sensitive information for most of the re-identified users. This clearly illustrates the fragility of the privacy of those who use online health forums.

[1]  Xiaoqian Jiang,et al.  SHARE: system design and case studies for statistical health information release , 2013, J. Am. Medical Informatics Assoc..

[2]  Arvind Narayanan,et al.  De-anonymizing Programmers via Code Stylometry , 2015, USENIX Security Symposium.

[3]  Jimeng Sun,et al.  Publishing data from electronic health records while preserving privacy: A survey of algorithms , 2014, J. Biomed. Informatics.

[4]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.

[5]  Shouling Ji,et al.  Structural Data De-anonymization: Quantification, Practice, and Implications , 2014, CCS.

[6]  Ariel Stolerman,et al.  Doppelgänger Finder: Taking Stylometry to the Underground , 2014, 2014 IEEE Symposium on Security and Privacy.

[7]  Luk Arbuckle,et al.  El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients , 2022 .

[8]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[9]  George Hripcsak,et al.  Health data use, stewardship, and governance: ongoing gaps and challenges: a report from AMIA's 2012 Health Policy Meeting , 2014, J. Am. Medical Informatics Assoc..

[10]  Prateek Mittal,et al.  On Your Social Network De-anonymizablity: Quantification and Large Scale Evaluation with Seed Knowledge , 2015, NDSS.

[11]  Yi-Liang Zhao,et al.  Bridging the Vocabulary Gap between Health Seekers and Healthcare Knowledge , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  Sotiris Ioannidis,et al.  Face/Off: Preventing Privacy Leakage From Photos in Social Networks , 2015, CCS.

[13]  M. Keeling,et al.  Impact of spatial clustering on disease transmission and optimal control , 2009, Proceedings of the National Academy of Sciences.

[14]  Angus Roberts,et al.  Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records , 2013, BMC Medical Informatics and Decision Making.

[15]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[16]  Ambrosio Toval,et al.  Are Personal Health Records Safe? A Review of Free Web-Accessible Personal Health Record Privacy Policies , 2012, Journal of medical Internet research.

[17]  G. G. Attridge The Characteristic Curve , 1991 .

[18]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[19]  Claude Castelluccia,et al.  How Unique and Traceable Are Usernames? , 2011, PETS.

[20]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[21]  Vitaly Shmatikov,et al.  De-anonymizing Social Networks , 2009, 2009 30th IEEE Symposium on Security and Privacy.

[22]  Ryen W. White,et al.  Studies of the onset and persistence of medical concerns in search logs , 2012, SIGIR '12.

[23]  Marc-Allen Cartright,et al.  Intentions and attention in exploratory health search , 2011, SIGIR.

[24]  John Noecker,et al.  Distractorless Authorship Verification , 2012, LREC.

[25]  Chunqiang Tang,et al.  On iterative intelligent medical search , 2008, SIGIR '08.

[26]  Meng Wang,et al.  Disease Inference from Health-Related Questions via Sparse Deep Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[27]  Bradley Malin,et al.  Anonymising and sharing individual patient data , 2015, BMJ : British Medical Journal.

[28]  Isaac S Kohane,et al.  Longitudinal histories as predictors of future diagnoses of domestic abuse: modelling study , 2009, BMJ : British Medical Journal.

[29]  Rajesh Sharma,et al.  DAPriv: Decentralized architecture for preserving the privacy of medical data , 2014, ArXiv.

[30]  Prateek Mittal,et al.  Graph Data Anonymization, De-Anonymization Attacks, and De-Anonymizability Quantification: A Survey , 2017, IEEE Communications Surveys & Tutorials.

[31]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[32]  Ariel Stolerman,et al.  Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[33]  Daniel C. Barth-Jones,et al.  The 'Re-Identification' of Governor William Weld's Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now , 2012 .

[34]  Ariel Stolerman,et al.  Breaking the Closed-World Assumption in Stylometric Authorship Attribution , 2014, IFIP Int. Conf. Digital Forensics.

[35]  Deven McGraw,et al.  Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data , 2013, J. Am. Medical Informatics Assoc..

[36]  Prateek Mittal,et al.  SecGraph: A Uniform and Open-source Evaluation System for Graph Data Anonymization and De-anonymization , 2015, USENIX Security Symposium.