BackgroundExpansion of Internet and its use for on-line activities such as E-Commerce and social networking are producing large volumes of transactional data. This huge data volume resulted from these activities facilitates the analysis and understanding of global trends and interesting patterns used for several decisive purposes. Analytics involved in these processes expose sensitive information present in these datasets, which is a serious privacy threat. To overcome this challenge, few sequential heuristics have been used in past where volumes of data were comparatively accommodating to these sequential heuristics; the current situation is not that much in-line and often results in high execution time. This new challenge of scalability paves a way for experimenting with Big Data approaches (e.g., MapReduce Framework). We have agglomerated the MapReduce framework with adopted heuristics to overcome this challenge of scalability along with much-needed privacy preservation and yields efficient analytic results within bounded execution times.MethodsMapReduce is a parallel programming framework [16] which provides us the opportunity to leverage largely distributed resources to deal with the Big Data analytics. MapReduce allows the resource of a largely distributed system to be utilized in a parallel fashion. The simplicity and high fault-tolerance are the key features which make MapReduce a promising framework. Therefore, we have proposed a two-phase MapReduce version of these adopted heuristics. MapReduce framework divides the whole data into ‘n’ number of data chunks D = {d 1 d ∪ 2 ∪ d 3 ..... ∪ d n } and distributes them over ‘n’ computing nodes to achieve the parallelization. The first phase of MapReduce job runs on each data chunk in order to generate intermediate results, which are further sorted and merged in the second phase to generate final sanitized dataset.ResultsWe conducted three set of experiments, each with five different scenarios corresponding to the different cluster sizes i.e., n = 1,2,3,4,5 where ‘n’ is a number of computing nodes. We compared the approaches with respect to real as well as synthetically generated large datasets. For varying data sizes and varying number of computing nodes, it has been observed that sanitization time required by the MapReduce-based algorithm for same size dataset is much less than the sequential traditional approach. Further, the scalability can be improved by using more number of computing nodes. Lastly, another set of experiments explores the change in sanitization time with varying sizes of the sensitive content present in a dataset. We evaluated the effectiveness of proposed approach in different scenarios, with varying cluster size from 1 to 5 nodes. It has been observed that still the execution time of our approach is much less than traditional schemes. Further, no hiding failure, artifactual patterns have been observed during the experiments as well as in terms of misses cost also the MapReduce version performance is same as of traditional approaches.ConclusionTraditional approaches for data hiding primarily MaxFIA and SWA were lacking with due inability to tackle large voluminous data. To subjugate the new challenge of scalability, we have implemented these basic heuristics with Big Data approach i.e., MapReduce framework. Quantitative evaluations have shown that the fusion of MapReduce framework with these adopted heuristics fulfills its obligatory responsibility of being scalable and many-fold faster for yielding efficient analytic results.
[1]
Durga Toshniwal,et al.
Parallelization of association rule mining: Survey
,
2015,
2015 International Conference on Computing, Communication and Security (ICCCS).
[2]
Fang Liu,et al.
Privacy-Preserving Scanning of Big Content for Sensitive Data Exposure with MapReduce
,
2015,
CODASPY.
[3]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[4]
Milind A. Bhandarkar,et al.
MapReduce programming with apache Hadoop
,
2010,
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[5]
Athanasios V. Vasilakos,et al.
Parallel Processing Systems for Big Data: A Survey
,
2016,
Proceedings of the IEEE.
[6]
Elisa Bertino,et al.
Hiding Association Rules by Using Confidence and Support
,
2001,
Information Hiding.
[7]
Xiaolei Dong,et al.
Security and privacy for storage and computation in cloud computing
,
2014,
Inf. Sci..
[8]
Cheng Huang,et al.
EFPA: Efficient and flexible privacy-preserving mining of association rule in cloud
,
2015,
2015 IEEE/CIC International Conference on Communications in China (ICCC).
[9]
Vassilios S. Verykios,et al.
Disclosure limitation of sensitive rules
,
1999,
Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453).
[10]
Ali Amiri,et al.
Dare to share: Protecting sensitive knowledge with data sanitization
,
2007,
Decis. Support Syst..
[11]
Stanley Robson de Medeiros Oliveira,et al.
Privacy preserving frequent itemset mining
,
2002
.
[12]
Elisa Bertino,et al.
Association rule hiding
,
2004,
IEEE Transactions on Knowledge and Data Engineering.
[13]
Philip S. Yu,et al.
Top-down specialization for information and privacy preservation
,
2005,
21st International Conference on Data Engineering (ICDE'05).
[14]
Elisa Bertino,et al.
Privacy-Preserving Association Rule Mining in Cloud Computing
,
2015,
AsiaCCS.
[15]
Athanasios V. Vasilakos,et al.
Two Schemes of Privacy-Preserving Trust Evaluation
,
2016,
Future Gener. Comput. Syst..
[16]
Jin Li,et al.
Privacy-preserving data utilization in hybrid clouds
,
2014,
Future Gener. Comput. Syst..
[17]
Jinjun Chen,et al.
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud
,
2014,
IEEE Transactions on Parallel and Distributed Systems.
[18]
Osmar R. Zaïane,et al.
Protecting sensitive knowledge by data sanitization
,
2003,
Third IEEE International Conference on Data Mining.
[19]
Xuyun Zhang,et al.
Privacy Preservation over Big Data in Cloud Systems
,
2014
.