Efficient Algorithms for Masking and Finding Quasi-Identifiers

A quasi-identifier refers to a subset of attributes that can uniquely identify most tuples in a table. Incautious publication of quasi-identifiers will lead to privacy leakage. In this paper we consider the problems of finding and masking quasi-identifiers. Both problems are provably hard with severe time and space requirements. We focus on designing ecient approximation algorithms for large data sets. We first propose two natural measures for quantifying quasi-identifiers: distinct ratio and separation ratio. We develop ecient algorithms that find small quasi-identifiers with provable size and separation/distinct ratio guarantees, with space and time requirements sublinear in the number of tuples. We also design practical algorithms for finding all minimal quasi-identifiers. Finally we propose ecient algorithms for masking quasi-identifiers, where we use a random sampling technique to greatly reduce the space and time requirements, without much sacrifice in the quality of the results. Our algorithms for masking and finding minimum quasi-identifiers naturally apply to stream databases. Extensive experimental results on real world data sets confirm eciency and accuracy of our algorithms.

[1]  Sachin Lodha,et al.  Probabilistic Anonymity , 2007, PinKDD.

[2]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[6]  Edward L. Robertson,et al.  On approximation measures for functional dependencies , 2004, Inf. Syst..

[7]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[8]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[9]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[10]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[11]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[12]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[13]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[14]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[15]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[16]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[17]  Mehmet M. Dalkilic,et al.  Using Horizontal-Vertical Decompositions to Improve Query Evaluation , 2002 .

[18]  R. Ravi,et al.  On the Approximability of the Minimum Test Collection Problem , 2001, ESA.

[19]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[20]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[21]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[22]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[23]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[24]  Stefan Kramer,et al.  Compression-Based Evaluation of Partial Determinations , 1995, KDD.

[25]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[26]  B. Moret,et al.  On Minimizing a Set of Tests , 1985 .

[27]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.