Lightning: Utility-Driven Anonymization of High-Dimensional Data

The ARX Data Anonymization Tool is a software for privacy-preserving microdata publishing. It implements methods of statistical disclosure control and supports a wide variety of privacy models, which are used to specify disclosure risk thresholds. Data is mainly transformed with a combination of two methods: 1 global recoding with full-domain generalization of attribute values followed by 2 local recoding with record suppression. Within this transformation model, given a dataset with low dimensionality, it is feasible to compute an optimal solution with minimal loss of data quality. However, combinatorial complexity renders this approach impracticable for high-dimensional data. In this article, we describe the Lightning algorithm, a simple, yet effective, utility-driven heuristic search strategy which we have implemented in ARX for anonymizing high-dimensional datasets. Our work improves upon existing methods because it is not tailored towards specific models for measuring disclosure risks and data utility. We have performed an extensive experimental evaluation in which we have compared our approach to state-of-the-art heuristic algorithms and a globally-optimal search algorithm. In this process, we have used several real-world datasets, different models for measuring data utility and a wide variety of privacy models. The results show that our method outperforms previous approaches in terms output quality, even when using k-anonymity, which is the model for which previous work has been designed.

[1]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[2]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[3]  Khaled El Emam,et al.  Anonymizing Health Data: Case Studies and Methods to Get You Started , 2013 .

[4]  Fabian Prasser,et al.  A Benchmark of Globally-Optimal Anonymization Methods for Biomedical Data , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[5]  Claudia Eckert,et al.  Flash: Efficient, Stable and Optimal K-Anonymity , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[6]  Fabian Prasser,et al.  The cost of quality: Implementing generalization and suppression for anonymizing biomedical data with minimal information loss , 2015, J. Biomed. Informatics.

[7]  Josep Domingo-Ferrer,et al.  t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[8]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[9]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[10]  Alan F. Karr,et al.  Risk‐Utility Paradigms for Statistical Disclosure Limitation: How to Think, But Not How to Act , 2011 .

[11]  Spiros Skiadopoulos,et al.  Anonymizing Data with Relational and Transaction Attributes , 2013, ECML/PKDD.

[12]  Ninghui Li,et al.  Slicing: A New Approach for Privacy Preserving Data Publishing , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Raymond Heatherly,et al.  Efficient discovery of de-identification policy options through a risk-utility frontier , 2013, CODASPY.

[14]  Khaled El Emam,et al.  A critical appraisal of the Article 29 Working Party Opinion 05/2014 on data anonymization techniques , 2015 .

[15]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[16]  Tamir Tassa,et al.  Efficient Anonymizations with Enhanced Utility , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[17]  Philip S. Yu,et al.  Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques , 2010 .

[18]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[19]  Latanya Sweeney,et al.  Computational disclosure control: a primer on data privacy protection , 2001 .

[20]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[21]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Fabian Prasser,et al.  Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool , 2015, Medical Data Privacy Handbook.

[23]  N. Hoshino,et al.  Applying Pitman's Sampling Formula to Microdata Disclosure Risk Assessment , 2001 .

[24]  B. Lo Sharing clinical trial data: maximizing benefits, minimizing risk. , 2015, JAMA.

[25]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[26]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[27]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[29]  Claudia Eckert,et al.  Highly efficient optimal k-anonymity for biomedical datasets , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[30]  S. Keller-McNulty,et al.  Estimation of Identi ® cation Disclosure Risk in Microdata , 1999 .

[31]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[32]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[33]  Fabian Prasser,et al.  ARX - A Comprehensive Tool for Anonymizing Biomedical Data , 2014, AMIA.

[34]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[35]  L. Zayatz,et al.  BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION REPORT SERIES SRD Research Report Number : CENSUS / SRD / RR-91 / 08 ESTIMATION OF THE PERCENT OF UNIQUE POPULATION ELEMENTS ON A MICRODATA FILE USING THE SAMPLE , 1998 .

[36]  Nitesh Kumar,et al.  Achieving k-anonymity Using Improved Greedy Heuristics for Very Large Relational Databases , 2013, Trans. Data Priv..

[37]  Khaled El Emam,et al.  Estimating the re-identification risk of clinical data sets , 2012, BMC Medical Informatics and Decision Making.

[38]  Latanya Sweeney,et al.  Datafly: A System for Providing Anonymity in Medical Data , 1997, DBSec.

[39]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.