Iterative Classification for Sanitizing Large-Scale Datasets

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose identities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify and remove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them to sanitize data can leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of data published and the risk of an adversary discovering leaked sensitive information. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted to be non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for the publisher that provably executes no more than a linear number of iterations, and ensures a low utility for a resource-limited adversary. Moreover, using several real world natural language corpora, we illustrate that our greedy algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

[1]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[2]  Joseph Olive,et al.  Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation , 2011 .

[3]  A. M. Dobie The Federal Rules of Civil Procedure , 1939 .

[4]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[5]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[6]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Herbert Burkert,et al.  Some Preliminary Comments on the DIRECTIVE 95/46/EC OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. , 1996 .

[9]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[10]  Son Doan,et al.  Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine , 2010, COLING.

[11]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[12]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[13]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[14]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[15]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[16]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[17]  William W. Cohen,et al.  Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text , 2005, HLT.

[18]  A. Miller,et al.  Federal rules of civil procedure , 1969 .

[19]  Oren Etzioni,et al.  Open domain event extraction from twitter , 2012, KDD.

[20]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[21]  Spiros Skiadopoulos,et al.  SECRETA: A System for Evaluating and Comparing RElational and Transaction Anonymization algorithms , 2014, EDBT.

[22]  Lyle H. Ungar,et al.  A system for de-identifying medical message board text , 2010, 2010 Ninth International Conference on Machine Learning and Applications.