t-Plausibility: Generalizing Words to Desensitize Text

De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

[1]  Balamurugan Anandan,et al.  Significance of Term Relationships on Anonymization , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[2]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[3]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[6]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[7]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[8]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[9]  P. Elango Coreference Resolution : A Survey , 2006 .

[10]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[11]  A. Meyer The Health Insurance Portability and Accountability Act. , 1997, Tennessee medicine : journal of the Tennessee Medical Association.

[12]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[14]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[15]  R. Caplan HIPAA. Health Insurance Portability and Accountability Act of 1996. , 2003, Dental assistant.

[16]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[17]  A. Reisner,et al.  De-identification algorithm for free-text nursing notes , 2005, Computers in Cardiology, 2005.

[18]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[19]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[20]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[21]  Chris Clifton,et al.  Privacy-Preserving Distributed k-Anonymity , 2005, DBSec.

[22]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[23]  Gökhan Tür,et al.  Sanitization and Anonymization of Document Repositories , 2009, Database Technologies: Concepts, Methodologies, Tools, and Applications.

[24]  Chris Clifton,et al.  Search-log anonymization and advertisement: are they mutually exclusive? , 2010, CIKM '10.