A Privacy Reinforcement Approach against De-identified Dataset

Protection of individual privacy has been a key issue for the corresponding data dissemination. Nowadays powerful search utilities increase the re-identification risk by easier information collection as well as validation than before. Despite there usually performs certain de-identified process, attackers may recognize someone from released dataset with which attacker-owned information is matched. In this paper, we propose an approach to mitigate the identity disclosure problem by generating plurals in a given dataset. The approach leverages decision tree to help selection of quasi-identifier and several masking techniques can be employed for privacy reinforcement. In addition to different privacy metrics applicability, the approach can achieve better trade-off between data integrity and privacy protection through flexible data masking.

[1]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[2]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[3]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[4]  Alessandro Acquisti,et al.  Privacy Concerns and Information Disclosure: An Illusion of Control Hypothesis , 2009 .

[5]  Josep Domingo-Ferrer,et al.  Fast Generation of Accurate Synthetic Microdata , 2004, Privacy in Statistical Databases.

[6]  J. Domingo-Ferrer,et al.  Resampling for statistical confidentiality in contingency tables , 1999 .

[7]  Epcd About - U.S. Census Bureau , 2012 .

[8]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[11]  Thomas C. Rindfleisch,et al.  Privacy, information technology, and health care , 1997, CACM.

[12]  Josep Domingo-Ferrer,et al.  Distance-based and probabilistic record linkage for re-identification of records with categorical variables ∗ , 2002 .

[13]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[14]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[15]  Aleš Florian,et al.  An efficient sampling scheme: Updated Latin Hypercube Sampling , 1992 .