A Parallel Method for Scalable Anonymization of Transaction Data

Transaction data, such as market basket or diagnostic data, contain sensitive information about individuals. Such data are often disseminated widely to support analytic studies. This raises privacy concerns, as the confidentiality of individuals must be protected. Economization is an established methodology to protect transaction data, which can be applied using different algorithms. RBAT is an algorithm for anonymzitng transaction data that has many desirable features. These include flexible specification of privacy requirements and the ability to preserve data utility well. However, like most economization methods, RBAT is a sequential algorithm that is not scalable to large datasets. This limits the applicability of RBAT in practice. To address this issue, in this paper, we develop a parallel version of RBAT using MapReduce. We partition the data across cluster of computing nodes and implement the key operations of RBAT in parallel. Our experimental results show that scalable economization of large transaction datasets can be achieved using MapReduce and our method can scale nearly linear to the number of processing nodes.

[1]  Aris Gkoulalas-Divanis,et al.  Efficient and flexible anonymization of transaction data , 2012, Knowledge and Information Systems.

[2]  Jinjun Chen,et al.  Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[3]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[4]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[5]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[6]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[7]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[8]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[9]  Rajkumar Buyya,et al.  MELODY-JOIN: Efficient Earth Mover's Distance similarity joins using MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[10]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[11]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[12]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[13]  Jimeng Sun,et al.  Publishing data from electronic health records while preserving privacy: A survey of algorithms , 2014, J. Biomed. Informatics.

[14]  David Taniar,et al.  High Performance Parallel Database Processing and Grid Databases , 2008 .

[15]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[16]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[17]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[18]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[19]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[20]  Chedy Raïssi,et al.  ρ-uncertainty , 2010, Proc. VLDB Endow..

[21]  David Taniar,et al.  High-Performance Parallel Database Processing and Grid Databases: Taniar/High-Performance Parallel DP & Grid DB , 2008 .

[22]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  A Special Report on Managing Information , 2022 .

[25]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[26]  Philip S. Yu,et al.  Privacy-Preserving Data Mining - Models and Algorithms , 2008, Advances in Database Systems.

[27]  Jinjun Chen,et al.  A Privacy Leakage Upper Bound Constraint-Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud , 2013, IEEE Transactions on Parallel and Distributed Systems.

[28]  D. DeWitt,et al.  K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  Nikos Mamoulis,et al.  Privacy Preservation by Disassociation , 2012, Proc. VLDB Endow..

[30]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[31]  Stefan Gottschalk,et al.  Privacy Preserving Data Mining Models And Algorithms , 2016 .

[32]  Jinjun Chen,et al.  A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud , 2014, IEEE Transactions on Parallel and Distributed Systems.

[33]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[34]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[35]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[36]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..