Experimenting sensitivity-based anonymization framework in apache spark

One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark’s processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.

[1]  Amalraj Irudayasamy,et al.  SCALABLE MULTIDIMENSIONAL ANONYMIZATION ALGORITHM OVER BIG DATA USING MAP REDUCE ON PUBLIC CLOUD , 2015 .

[2]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[3]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[4]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[5]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[6]  Rohit Kumar,et al.  Privacy Preserving Big Data publishing- A scalable K-anonymization approach using MapReduce , 2018 .

[7]  Jinjun Chen,et al.  A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud , 2014, J. Comput. Syst. Sci..

[8]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[9]  V. B. Dalvi,et al.  Bottom-Up Generalization: A Data Mining Solution to Privacy Protection , 2015 .

[10]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[11]  S. Muthusundari,et al.  Data anonymization through generalization using map reduce on cloud , 2014, Proceedings of IEEE International Conference on Computer Communication and Systems ICCCS14.

[12]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Spark Perspective , 2015 .

[13]  Mohammed Al-Zobbi,et al.  Towards optimal sensitivity-based anonymization for big data , 2017, 2017 27th International Telecommunication Networks and Applications Conference (ITNAC).

[14]  K. R. Pandilakshmi,et al.  An Advanced Bottom up Generalization Approach for Big Data on Cloud , 2014 .

[15]  Mohammed Al-Zobbi,et al.  Sensitivity-Based Anonymization of Big Data , 2016, 2016 IEEE 41st Conference on Local Computer Networks Workshops (LCN Workshops).

[16]  Mohammed Guller Big Data Analytics with Spark , 2015, Apress.

[17]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[18]  Mohammed Al-Zobbi,et al.  Implementing A Framework for Big Data Anonymity and Analytics Access Control , 2017, 2017 IEEE Trustcom/BigDataSE/ICESS.

[19]  Rohan Arora,et al.  Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means , 2015 .

[20]  R. Motwani,et al.  Efficient Algorithms for Masking and Finding Quasi-Identifiers , 2007 .

[21]  L. Arockiam,et al.  Parallel Bottom-up Generalization Approach for Data Anonymization using Map Reduce for Security of Data in Public Cloud , 2015 .

[22]  D. West Introduction to Graph Theory , 1995 .

[23]  Mohammed Guller Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis , 2015 .

[24]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[25]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[26]  Jinjun Chen,et al.  A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud , 2014, IEEE Transactions on Parallel and Distributed Systems.