Personal Health Information Leak Prevention in Heterogeneous Texts

We built a system which prevents leaks of personal health information inadvertently disclosed in heterogeneous text data. The system works with free-form texts. We empirically tested the system on files gathered from peer-to-peer file exchange networks. This study presents our text analysis apparatus. We discuss adaptation of lexical sources used in medical, scientific, domain for analysis of personal health information.

[1]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[2]  Jessica Staddon,et al.  Detecting privacy leaks using corpus-based association rules , 2008, KDD.

[3]  Ravi Kumar,et al.  Vanity fair: privacy in querylog bundles , 2008, CIKM '08.

[4]  Johnny Long,et al.  No Tech Hacking: A Guide to Social Engineering, Dumpster Diving, and Shoulder Surfing , 2008 .

[5]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[6]  Khaled El Emam,et al.  Evaluation of Learning from Screened Positive Examples , 2008 .

[7]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[8]  Maarten de Rijke,et al.  Specificity Helps Text Classification , 2006, ECIR.

[9]  David P. Woodruff,et al.  Epistemic privacy , 2008, JACM.

[10]  Mitzi Waltz,et al.  Webster's New World Medical Dictionary , 2000 .

[11]  M. Eric Johnson,et al.  Data Hemorrhages in the Health-Care Sector , 2009, Financial Cryptography.

[12]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[13]  D Geffner-Sclarsky,et al.  [Hospital care for stroke patients in the Valencian Region using the basic minimum data set from the International Classification of Diseases, 9th revision, clinical modification]. , 2006, Revista de neurologia.

[14]  Vasudevan Jagannathan,et al.  Natural language processing framework to assess clinical conditions. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[15]  Pam Dixon Medical Identity Theft: the Information Crime That Can Kill You , 2006 .

[16]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[17]  Michael Roe,et al.  Scanning electronic documents for personally identifiable information , 2006, WPES '06.

[18]  Latanya Sweeney Protecting job seekers from identity theft , 2006, IEEE Internet Computing.

[19]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[20]  Fredric M. Wolf,et al.  Publication trends in the medical informatics literature: 20 years of "Medical Informatics" in MeSH , 2009, BMC Medical Informatics Decis. Mak..

[21]  Khaled El Emam,et al.  An Evaluation of Personal Health Information Remnants in Second-Hand Personal Computer Disk Drives , 2007, Journal of medical Internet research.

[22]  Vasudevan Jagannathan,et al.  Assessment of commercial NLP engines for medication information extraction from dictated clinical notes , 2009, Int. J. Medical Informatics.

[23]  B. Elger,et al.  Consent and anonymization in research involving biobanks , 2006, EMBO reports.