Privacy Vulnerabilities of Dataset Anonymization Techniques

Vast amounts of information of all types are collected daily about people by governments, corporations and individuals. The information is collected when users register to or use on-line applications, receive health related services, use their mobile phones, utilize search engines, or perform common daily activities. As a result, there is an enormous quantity of privately-owned records that describe individuals' finances, interests, activities, and demographics. These records often include sensitive data and may violate the privacy of the users if published. The common approach to safeguarding user information, or data in general, is to limit access to the storage (usually a database) by using and authentication and authorization protocol. This way, only users with legitimate permissions can access the user data. In many cases though, the publication of user data for statistical analysis and research can be extremely beneficial for both academic and commercial uses, such as statistical research and recommendation systems. To maintain user privacy when such a publication occurs many databases employ anonymization techniques, either on the query results or the data itself. In this paper we examine variants of 2 such techniques, "data perturbation" and "query-set-size control" and discuss their vulnerabilities. Data perturbation deals with changing the values of records in the dataset while maintaining a level of accuracy over the resulting queries. We focus on a relatively new data perturbation method called NeNDS to show a possible partial knowledge attack on its privacy. The query-set-size control allows publication of a query result dependent on having a minimum set size, k, of records satisfying the query parameters. We show some query types relying on this method may still be used to extract hidden information, and prove others maintain privacy even when using multiple queries.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2016, J. Priv. Confidentiality.

[3]  Ming-Yang Kao,et al.  An approximation algorithm for a bottleneck traveling salesman problem , 2009, J. Discrete Algorithms.

[4]  Wenliang Du,et al.  Privacy-preserving collaborative filtering using randomized perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[5]  Chris Clifton,et al.  How Much Is Enough? Choosing ε for Differential Privacy , 2011, ISC.

[6]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[7]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[8]  Jordi Forné,et al.  A Privacy-Protecting Architecture for Collaborative Filtering via Forgery and Suppression of Ratings , 2011, DPM/SETOP.

[9]  I. Hozo,et al.  Estimating the mean and variance from the median, range, and the size of a sample , 2005, BMC medical research methodology.

[10]  Anand D. Sarwate,et al.  Signal Processing and Machine Learning with Differential Privacy: Algorithms and Challenges for Continuous Data , 2013, IEEE Signal Processing Magazine.

[11]  Krishna P. Gummadi,et al.  Privacy Risks with Facebook's PII-Based Targeting: Auditing a Data Broker's Advertising Interface , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[12]  Douglas M. Blough,et al.  Privacy Preserving Collaborative Filtering Using Data Obfuscation , 2007, 2007 IEEE International Conference on Granular Computing (GRC 2007).

[13]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[14]  Taghi M. Khoshgoftaar,et al.  A Survey of Collaborative Filtering Techniques , 2009, Adv. Artif. Intell..

[15]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.