Hiding distinguished ones into crowd: privacy-preserving publishing data with outliers

Publishing microdata raises concerns of individual privacy. When there exist outlier records in the microdata, the distinguishability of the outliers enables their privacy to be easier to be compromised than that of regular ones. However, none of the existing anonymization techniques can provide sufficient protection to the privacy of the outliers. In this paper, we study the problem of anonymizing the micro-data that contains outliers. We define the distinguishability-based attack by which the adversary can infer the existence of outliers as well as their private information from the anonymized microdata. To defend against the distinguishability-based attack, we define the plain k-anonymity as the privacy principle. Based on the definition, we categorize the outliers into two types, the ones that cannot be hidden by any plain k-anonymous group (called global outliers) and the ones that can (called local outliers). We propose the algorithm to efficiently anonymize local outliers with low information loss. Our experiments demonstrate the efficiency and effectiveness of our approach.

[1]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[3]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[5]  Yufei Tao,et al.  Preservation of proximity privacy in publishing numerical sensitive data , 2008, SIGMOD Conference.

[6]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[7]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  Ashwin Machanavajjhala,et al.  Worst-Case Background Knowledge for Privacy-Preserving Data Publishing , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[10]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[11]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[12]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[13]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[14]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[15]  Charu C. Aggarwal,et al.  On Randomization, Public Information and the Curse of Dimensionality , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[16]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[17]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[18]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[19]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[20]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[21]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[23]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[24]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[25]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[26]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[27]  Y. Chen [The change of serum alpha 1-antitrypsin level in patients with spontaneous pneumothorax]. , 1995, Zhonghua jie he he hu xi za zhi = Zhonghua jiehe he huxi zazhi = Chinese journal of tuberculosis and respiratory diseases.

[28]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[29]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[30]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[31]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[32]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).