On the identity anonymization of high‐dimensional rating data

We study the challenges of protecting the privacy of individuals in a large public survey rating data. The survey rating data usually contains both ratings of sensitive and non‐sensitive issues. The ratings of sensitive issues involve personal privacy. Although the survey participants do not reveal any of their ratings, their survey records are potentially identifiable by using information from other public sources. None of the existing anonymization principles (e.g. k‐anonymity, l‐diversity, etc.) can effectively prevent such breaches in large survey rating data sets. In this paper, we tackle the problem by defining a principle called (k, epsilon, l)‐anonymity. The principle requires that, for each transaction t in the given survey rating data T, at least (k − 1) other transactions in T must have ratings similar to t, where the similarity is controlled by ε and the standard deviation of sensitive ratings is at least l. We propose a greedy approach to anonymize the survey rating data that scales almost linearly with the input size, and we apply the method to two real‐life data sets to demonstrate their efficiency and practical utility. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  John Riedl,et al.  You are what you say: privacy risks of public mentions , 2006, SIGIR '06.

[2]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[3]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[4]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[6]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[7]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  Dino Pedreschi,et al.  Blocking anonymity threats raised by frequent itemset mining , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[10]  Hua Wang,et al.  Satisfying Privacy Requirements: One Step before Anonymization , 2010, PAKDD.

[11]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[12]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[13]  Daniel Kifer,et al.  Injecting utility into anonymized datasets , 2006, SIGMOD Conference.

[14]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[15]  Hua Wang,et al.  Injecting purpose and trust into data anonymisation , 2009, CIKM.

[16]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[17]  Hua Wang,et al.  Extended k-anonymity models against sensitive attribute disclosure , 2011, Comput. Commun..

[18]  Elisa Bertino,et al.  Association rule hiding , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  Dino Pedreschi,et al.  Anonymity preserving pattern discovery , 2008, The VLDB Journal.

[21]  Dino Pedreschi,et al.  k-Anonymous Patterns , 2005, PKDD.

[22]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[24]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[25]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[26]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.