Scalable Iterative Classification for Sanitizing Large-Scale Datasets

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93 percent of the original data, and completes after at most five iterations.

[1]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[2]  Raymond Heatherly,et al.  A Game Theoretic Framework for Analyzing Re-Identification Risk , 2015, PloS one.

[3]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[4]  Anthony N. Nguyen,et al.  De-identification of health records using Anonym: Effectiveness and robustness across datasets , 2014, Artif. Intell. Medicine.

[5]  Yevgeniy Vorobeychik,et al.  Optimal randomized classification in adversarial settings , 2014, AAMAS.

[6]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[7]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[10]  Ashwin Machanavajjhala,et al.  Pufferfish , 2014, ACM Trans. Database Syst..

[11]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[13]  Yevgeniy Vorobeychik,et al.  Iterative Classification for Sanitizing Large-Scale Datasets , 2015, 2015 IEEE International Conference on Data Mining.

[14]  Pedro M. Domingos,et al.  Adversarial classification , 2004, KDD.

[15]  Sarit Kraus,et al.  Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games , 2008, AAMAS.

[16]  Juliane Hahn,et al.  Security And Game Theory Algorithms Deployed Systems Lessons Learned , 2016 .

[17]  Chris Clifton,et al.  Classifier evaluation and attribute selection against active adversaries , 2010, Data Mining and Knowledge Discovery.

[18]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[19]  Herbert Burkert,et al.  Some Preliminary Comments on the DIRECTIVE 95/46/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. , 1996 .

[20]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[21]  Son Doan,et al.  Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine , 2010, COLING.

[22]  Ashwin Machanavajjhala,et al.  Blowfish privacy: tuning privacy-utility trade-offs using policies , 2013, SIGMOD Conference.

[23]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Roger J. Bowden,et al.  The Privacy Bootstrap , 1992 .

[25]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[26]  Lynette Hirschman,et al.  Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification , 2016, Methods of Information in Medicine.

[27]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Yevgeniy Vorobeychik,et al.  Feature Cross-Substitution in Adversarial Classification , 2014, NIPS.

[29]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[30]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[31]  Spiros Skiadopoulos,et al.  SECRETA: A System for Evaluating and Comparing RElational and Transaction Anonymization algorithms , 2014, EDBT.

[32]  Lyle H. Ungar,et al.  A system for de-identifying medical message board text , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[33]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[34]  Manish Jain,et al.  Software Assistants for Randomized Patrol Planning for the LAX Airport Police and the Federal Air Marshal Service , 2010, Interfaces.

[35]  Spiros Skiadopoulos,et al.  Anonymizing Data with Relational and Transaction Attributes , 2013, ECML/PKDD.

[36]  UzunerÖzlem,et al.  A de-identifier for medical discharge summaries , 2008 .

[37]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Benjamin C. M. Fung,et al.  Anonymizing sequential releases , 2006, KDD '06.

[39]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[40]  Yevgeniy Vorobeychik,et al.  Scalable Optimization of Randomized Operational Decisions in Adversarial Classification Settings , 2015, AISTATS.

[41]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[42]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[43]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[44]  A. M. Dobie The Federal Rules of Civil Procedure , 1939 .

[45]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[46]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[47]  Lynette Hirschman,et al.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text , 2013, J. Am. Medical Informatics Assoc..

[48]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[49]  Jonathan M. Garibaldi,et al.  Automatic detection of protected health information from clinic narratives , 2015, J. Biomed. Informatics.

[50]  Xiaolong Wang,et al.  Automatic de-identification of electronic medical records using token-level and character-level conditional random fields , 2015, J. Biomed. Informatics.

[51]  Quanyan Zhu,et al.  Game theory meets network security and privacy , 2013, CSUR.

[52]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[53]  Alan L. Yuille,et al.  Detecting and reading text in natural scenes , 2004, CVPR 2004.

[54]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[55]  Jessica Staddon,et al.  Detecting privacy leaks using corpus-based association rules , 2008, KDD.

[56]  Goran Nenadic,et al.  Combining knowledge- and data-driven methods for de-identification of clinical narratives , 2015, J. Biomed. Informatics.

[57]  Lifang Gu,et al.  Privacy-preserving data linkage protocols , 2004, WPES '04.

[58]  A. Miller,et al.  Federal rules of civil procedure , 1969 .

[59]  Yevgeniy Vorobeychik,et al.  Optimizing annotation resources for natural language de-identification via a game theoretic framework , 2016, J. Biomed. Informatics.

[60]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[61]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[62]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[63]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[64]  Tobias Scheffer,et al.  Stackelberg games for adversarial prediction problems , 2011, KDD.

[65]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.