Efficient and flexible anonymization of transaction data

Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing transaction data have been proposed recently, but they may produce excessively distorted and inadequately protected solutions. This is because these approaches do not consider privacy requirements that are common in real-world applications in a realistic and flexible manner, and attempt to safeguard the data only against either identity disclosure or sensitive information inference. In this paper, we propose a new approach that overcomes these limitations. We introduce a rule-based privacy model that allows data publishers to express fine-grained protection requirements for both identity and sensitive information disclosure. Based on this model, we also develop two anonymization algorithms. Our first algorithm works in a top-down fashion, employing an efficient strategy to recursively generalize data with low information loss. Our second algorithm uses sampling and a combination of top-down and bottom-up generalization heuristics, which greatly improves scalability while maintaining low information loss. Extensive experiments show that our algorithms significantly outperform the state-of-the-art in terms of retaining data utility, while achieving good protection and scalability.

[1]  Philip S. Yu,et al.  Handicapping attacker's confidence: an alternative to k-anonymization , 2006, Knowledge and Information Systems.

[2]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[3]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[4]  Francesco Bonchi,et al.  Hiding Sequential and Spatiotemporal Patterns , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[6]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[7]  Nikos Mamoulis,et al.  Privacy Preservation in the Publication of Trajectories , 2008, The Ninth International Conference on Mobile Data Management (mdm 2008).

[8]  Philip S. Yu,et al.  On the Hardness of Graph Anonymization , 2011, 2011 IEEE 11th International Conference on Data Mining.

[9]  Ninghui Li,et al.  Optimal k-Anonymity with Flexible Generalization Schemes through Bottom-up Searching , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[10]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[11]  Aris Gkoulalas-Divanis,et al.  Hiding sensitive knowledge without side effects , 2009, Knowledge and Information Systems.

[12]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[13]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[14]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[15]  Cyrus Shahabi,et al.  Location privacy: going beyond K-anonymity, cloaking and anonymizers , 2011, Knowledge and Information Systems.

[16]  Raymond Chi-Wing Wong,et al.  FF-Anonymity: When Quasi-identifiers Are Missing , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Chedy Raïssi,et al.  ρ-uncertainty , 2010, Proc. VLDB Endow..

[18]  David D. Jensen,et al.  Accurate Estimation of the Degree Distribution of Private Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[19]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[20]  Wenliang Du,et al.  A hybrid multi-group approach for privacy-preserving data mining , 2009, Knowledge and Information Systems.

[21]  Ninghui Li,et al.  Minimizing minimality and maximizing utility , 2010, Proc. VLDB Endow..

[22]  Paola Velardi,et al.  A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  Graham Cormode,et al.  Personal privacy vs population privacy: learning to attack anonymization , 2011, KDD.

[24]  Ashwin Machanavajjhala,et al.  Data Publishing against Realistic Adversaries , 2009, Proc. VLDB Endow..

[25]  Benjamin C. M. Fung,et al.  Anonymizing healthcare data: a case study on the blood transfusion service , 2009, KDD.

[26]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[27]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[28]  Keke Chen,et al.  Under Consideration for Publication in Knowledge and Information Systems Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining , 2010 .

[29]  Raymond Chi-Wing Wong,et al.  Information based data anonymization for classification utility , 2011, Data Knowl. Eng..

[30]  Massimo Barbaro,et al.  A Face Is Exposed for AOL Searcher No , 2006 .

[31]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[32]  Jian Pei,et al.  Publishing Sensitive Transactions for Itemset Utility , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[33]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[34]  Nick Koudas,et al.  The design of a query monitoring system , 2009, TODS.

[35]  Osmar R. Zaïane,et al.  Protecting sensitive knowledge by data sanitization , 2003, Third IEEE International Conference on Data Mining.

[36]  Ling Qiu,et al.  Protecting business intelligence and customer privacy while outsourcing data mining tasks , 2008, Knowledge and Information Systems.

[37]  Ke Wang,et al.  Privacy Risk in Graph Stream Publishing for Social Network Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[38]  Wynne Hsu,et al.  Using General Impressions to Analyze Discovered Classification Rules , 1997, KDD.

[39]  Sushil Jajodia,et al.  Information disclosure under realistic assumptions: privacy versus optimality , 2007, CCS '07.

[40]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[41]  Xiaowei Ying,et al.  On link privacy in randomizing social networks , 2010, Knowledge and Information Systems.

[42]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[43]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[44]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[45]  Ke Wang,et al.  Visually Aided Exploration of Interesting Association Rules , 1999, PAKDD.

[46]  Benjamin C. M. Fung,et al.  Publishing set-valued data via differential privacy , 2011, Proc. VLDB Endow..

[47]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[48]  Panos Kalnis,et al.  Anonymous Publication of Sensitive Transactional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[49]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[50]  Aris Gkoulalas-Divanis,et al.  Revisiting sequential pattern hiding to enhance utility , 2011, KDD.

[51]  Vagelis Hristidis,et al.  Authority-based keyword search in databases , 2008, TODS.

[52]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[53]  Raymond Chi-Wing Wong,et al.  Anonymization-based attacks in privacy-preserving data publishing , 2009, TODS.

[54]  Jian Pei,et al.  The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks , 2011, Knowledge and Information Systems.

[55]  Joydeep Ghosh,et al.  Evaluating the novelty of text-mined rules using lexical knowledge , 2001, KDD '01.

[56]  Philip S. Yu,et al.  Template-based privacy preservation in classification problems , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[57]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[58]  Chris Clifton,et al.  Hiding the presence of individuals from shared databases , 2007, SIGMOD '07.

[59]  Philip S. Yu,et al.  Privacy-preserving social network publication against friendship attacks , 2011, KDD.

[60]  Anna Oganian,et al.  A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality , 2006 .

[61]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[62]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[63]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[64]  A Savoy-Lewis,et al.  Health Insurance Portability and Accountability Act of 1996: a tempered victory. , 1996, The Journal of law, medicine & ethics : a journal of the American Society of Law, Medicine & Ethics.

[65]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[66]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[67]  Rakesh Agrawal,et al.  Securing electronic health records without impeding the flow of information , 2007, Int. J. Medical Informatics.

[68]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[69]  Francesco Bonchi,et al.  Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[70]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[71]  Walid G. Aref,et al.  Supporting views in data stream management systems , 2010, TODS.

[72]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[73]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[74]  Mohamed F. Mokbel,et al.  Identifying Unsafe Routes for Network-Based Trajectory Privacy , 2009, SDM.

[75]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[76]  Laks V. S. Lakshmanan,et al.  Trajectory anonymity in publishing personal mobility data , 2011, SKDD.

[77]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[78]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[79]  Benjamin C. M. Fung,et al.  Walking in the crowd: anonymizing trajectory data for pattern analysis , 2009, CIKM.

[80]  Huseyin Polat,et al.  Privacy-preserving hybrid collaborative filtering on cross distributed data , 2011, Knowledge and Information Systems.

[81]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[82]  Yufei Tao,et al.  Transparent anonymization: Thwarting adversaries who know the algorithm , 2010, TODS.

[83]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[84]  Torben Hagerup,et al.  A Guided Tour of Chernoff Bounds , 1990, Inf. Process. Lett..

[85]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[86]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[87]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[88]  Chris Clifton,et al.  Multirelational k-Anonymity , 2007, IEEE Transactions on Knowledge and Data Engineering.

[89]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[90]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[91]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[92]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[93]  Ke Wang,et al.  Anonymizing Transaction Data by Integrating Suppression and Generalization , 2010, PAKDD.

[94]  P. Loy International Classification of Diseases--9th revision. , 1978, Medical record and health care information journal.

[95]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[96]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[97]  Elisa Bertino,et al.  Association rule hiding , 2004, IEEE Transactions on Knowledge and Data Engineering.

[98]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[99]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.