A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud

A large number of cloud services require users to share private data like electronic health records for data analysis or mining, bringing privacy concerns. Anonymizing data sets via generalization to satisfy certain privacy requirements such as k-anonymity is a widely used category of privacy preserving techniques. At present, the scale of data in many cloud applications increases tremendously in accordance with the Big Data trend, thereby making it a challenge for commonly used software tools to capture, manage, and process such large-scale data within a tolerable elapsed time. As a result, it is a challenge for existing anonymization approaches to achieve privacy preservation on privacy-sensitive large-scale data sets due to their insufficiency of scalability. In this paper, we propose a scalable two-phase top-down specialization (TDS) approach to anonymize large-scale data sets using the MapReduce framework on cloud. In both phases of our approach, we deliberately design a group of innovative MapReduce jobs to concretely accomplish the specialization computation in a highly scalable way. Experimental evaluation results demonstrate that with our approach, the scalability and efficiency of TDS can be significantly improved over existing approaches.

[1]  Chris Clifton,et al.  A secure distributed framework for achieving k-anonymity , 2006, The VLDB Journal.

[2]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[3]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[4]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  N. Cao,et al.  Privacy-preserving multi-keyword ranked search over encrypted cloud data , 2011, 2011 Proceedings IEEE INFOCOM.

[6]  Yi Liang,et al.  In Cloud, Can Scientific Communities Benefit from the Economies of Scale? , 2010, IEEE Transactions on Parallel and Distributed Systems.

[7]  Surajit Chaudhuri,et al.  What next?: a half-dozen data management research goals for big data and the cloud , 2012, PODS '12.

[8]  D. DeWitt,et al.  K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Wen-Guey Tzeng,et al.  A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding , 2012, IEEE Transactions on Parallel and Distributed Systems.

[10]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[11]  Nāgārjuna,et al.  A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding , 2014 .

[12]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[13]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[14]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[15]  Chen Li,et al.  Inside "Big Data management": ogres, onions, or parfaits? , 2012, EDBT '12.

[16]  Li Xiong,et al.  Distributed Anonymization: Achieving Privacy for Both Data Subjects and Data Providers , 2009, DBSec.

[17]  Gail-Joon Ahn,et al.  Security and Privacy Challenges in Cloud Computing Environments , 2010, IEEE Security & Privacy.

[18]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[19]  Benjamin C. M. Fung,et al.  Anonymity meets game theory: secure data integration with malicious participants , 2011, The VLDB Journal.

[20]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[21]  Benjamin C. M. Fung,et al.  Privacy-preserving data publishing for cluster analysis , 2009, Data Knowl. Eng..

[22]  Dimitrios Zissis,et al.  Addressing cloud computing security issues , 2012, Future Gener. Comput. Syst..

[23]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[24]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[25]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Benjamin C. M. Fung,et al.  Centralized and Distributed Anonymization for High-Dimensional Healthcare Data , 2010, TKDD.

[27]  Jinjun Chen,et al.  A Privacy Leakage Upper Bound Constraint-Based Approach for Cost-Effective Privacy Preserving of Intermediate Data Sets in Cloud , 2013, IEEE Transactions on Parallel and Distributed Systems.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[30]  LeFevreKristen,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008 .

[31]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.