Security and privacy aspects in MapReduce on clouds: A survey

MapReduce is a programming system for distributed processing large-scale data in an efficient and fault tolerant manner on a private, public, or hybrid cloud. MapReduce is extensively used daily around the world as an efficient distributed computation tool for a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and analysis of social networks. Security and privacy of data and MapReduce computations are essential concerns when a MapReduce computation is executed in public or hybrid clouds. In order to execute a MapReduce job in public and hybrid clouds, authentication of mappers-reducers, confidentiality of data-computations, integrity of data-computations, and correctness-freshness of the outputs are required. Satisfying these requirements shield the operation from several types of attacks on data and MapReduce computations. In this paper, we investigate and discuss security and privacy challenges and requirements, considering a variety of adversarial capabilities, and characteristics in the scope of MapReduce. We also provide a review of existing security and privacy protocols for MapReduce and discuss their overhead issues.

[1]  Bo Wang,et al.  Data Cube Computational Model with Hadoop MapReduce , 2014, WEBIST.

[2]  Maurizio Rafanelli,et al.  Suppressing marginal cells to protect sensitive information in a two-dimensional statistical table (extended abstract) , 1991, PODS.

[3]  Ting Yu,et al.  SecureMR: A Service Integrity Assurance Framework for MapReduce , 2009, 2009 Annual Computer Security Applications Conference.

[4]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[5]  Ying Zhang,et al.  SecDM: Securing Data Migration between Cloud Storage Systems , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[6]  L. Venkata Subramaniam,et al.  Processing multi-way spatial joins on map-reduce , 2013, EDBT '13.

[7]  Emmett Dulaney Comptia Security+ Study Guide: Exam Sy0-101 , 1977 .

[8]  Andy Parrish,et al.  Efficient Computationally Private Information Retrieval from Anonymity or Trapdoor Groups , 2010, ISC.

[9]  Murat Kantarcioglu,et al.  SEMROD: Secure and Efficient MapReduce Over HybriD Clouds , 2015, SIGMOD Conference.

[10]  Murat Kantarcioglu,et al.  Vigiles: Fine-Grained Access Control for MapReduce Systems , 2014, 2014 IEEE International Congress on Big Data.

[11]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[12]  Yvo Desmedt,et al.  Relay Attack , 2005, Encyclopedia of Cryptography and Security.

[13]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[14]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[15]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[16]  Jinjun Chen,et al.  A security framework in G-Hadoop for big data computing across distributed Cloud data centres , 2014, J. Comput. Syst. Sci..

[17]  Gail-Joon Ahn,et al.  Security and Privacy Challenges in Cloud Computing Environments , 2010, IEEE Security & Privacy.

[18]  Kyungho Jeon,et al.  The HybrEx Model for Confidentiality and Privacy in Cloud Computing , 2011, HotCloud.

[19]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[20]  Wen-Guey Tzeng,et al.  Toward Data Confidentiality via Integrating Hybrid Encryption Schemes and Hadoop Distributed File System , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[21]  Michael T. Goodrich,et al.  Simulating Parallel Algorithms in the MapReduce Framework with Applications to Parallel Computational Geometry , 2010, ArXiv.

[22]  Jian Pei,et al.  A brief survey on anonymization techniques for privacy preserving publishing of social network data , 2008, SKDD.

[23]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[24]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[25]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[26]  Himanshu Gupta,et al.  ε-Controlled-Replicate: An ImprovedControlled-Replicate Algorithm for Multi-way Spatial Join Processing on Map-Reduce , 2014, WISE.

[27]  Kim-Kwang Raymond Choo,et al.  A survey of information security incident handling in the cloud , 2015, Comput. Secur..

[28]  Jeffrey D. Ullman,et al.  Computing marginals using MapReduce , 2018, J. Comput. Syst. Sci..

[29]  Gary Anthes,et al.  Security in the cloud , 2010, Commun. ACM.

[30]  L. Venkata Subramaniam,et al.  Processing Interval Joins On Map-Reduce , 2014, EDBT.

[31]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[32]  Jeffrey D. Ullman,et al.  Assignment Problems of Different-Sized Inputs in MapReduce , 2015, ACM Trans. Knowl. Discov. Data.

[33]  Keke Chen,et al.  Secure MapReduce Power Iteration in the Cloud , 2012, ArXiv.

[34]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[35]  Jeffrey D. Ullman,et al.  Bounds for Overlapping Interval Join on MapReduce , 2015, EDBT/ICDT Workshops.

[36]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[37]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[38]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[39]  Weidong Shi,et al.  PFC: Privacy Preserving FPGA Cloud - A Case Study of MapReduce , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[40]  Ting Yu,et al.  iBigTable: practical data integrity for bigtable in public cloud , 2013, CODASPY '13.

[41]  Yucong Duan,et al.  IntegrityMR: Integrity assurance framework for big data analytics and management applications , 2013, 2013 IEEE International Conference on Big Data.

[42]  Eyal Kushilevitz,et al.  Private information retrieval , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[43]  Jeffrey D. Ullman Designing good MapReduce algorithms , 2012, XRDS.

[44]  Elisa Bertino,et al.  Privacy Protection , 2022 .

[45]  Jeffrey D. Ullman,et al.  Meta-MapReduce: A Technique for Reducing Communication in MapReduce Computations , 2015, ArXiv.

[46]  Xin Yang,et al.  SAPSC: Security Architecture of Private Storage Cloud Based on HDFS , 2012, 2012 26th International Conference on Advanced Information Networking and Applications Workshops.

[47]  Ralph C. Merkle,et al.  A Digital Signature Based on a Conventional Encryption Function , 1987, CRYPTO.

[48]  Shlomi Dolev,et al.  Private and Secure Secret Shared MapReduce (Extended Abstract) - (Extended Abstract) , 2016, DBSec.

[49]  Jeffrey D. Ullman,et al.  Matching bounds for the all-pairs MapReduce problem , 2013, IDEAS '13.

[50]  Jeffrey D. Ullman,et al.  Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation , 2012, ArXiv.

[51]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[52]  Jinpeng Wei,et al.  VIAF: Verification-Based Integrity Assurance Framework for MapReduce , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[53]  Raouf Boutaba,et al.  Cloud computing: state-of-the-art and research challenges , 2010, Journal of Internet Services and Applications.

[54]  Jinjun Chen,et al.  Privacy-Preserving Layer over MapReduce on Cloud , 2012, 2012 Second International Conference on Cloud and Green Computing.

[55]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[56]  Huaimin Wang,et al.  PIIM: Method of Identifying Malicious Workers in the MapReduce System with an Open Environment , 2014, 2014 IEEE 8th International Symposium on Service Oriented System Engineering.

[57]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[58]  Hari Balakrishnan,et al.  CryptDB: processing queries on an encrypted database , 2012, CACM.

[59]  Murat Kantarcioglu,et al.  TrustMR: Computation integrity assurance system for MapReduce , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[60]  Silvio Ranise Data and Applications Security and Privacy XXX 30th Annual IFIP WG 11.3 Conference, DBSec 2016, Trento, Italy, July 18-20, 2016. Proceedings , 2016 .

[61]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[62]  Yang Xiao,et al.  Achieving Accountable MapReduce in cloud computing , 2014, Future Gener. Comput. Syst..

[63]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[64]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[65]  Pai-Cheng Chu,et al.  Cell Suppression Methodology: The Importance of Suppressing Marginal Totals , 1997, IEEE Trans. Knowl. Data Eng..

[66]  Dimitrios Zissis,et al.  Addressing cloud computing security issues , 2012, Future Gener. Comput. Syst..

[67]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[68]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[69]  Patrick Th. Eugster,et al.  Assured Cloud-Based Data Analysis with ClusterBFT , 2013, Middleware.

[70]  David M. Nicol,et al.  Denial-of-Service Threat to Hadoop/YARN Clusters with Multi-tenancy , 2014, 2014 IEEE International Congress on Big Data.

[71]  Bhavani M. Thuraisingham,et al.  Honeypot based unauthorized data access detection in MapReduce systems , 2015, 2015 IEEE International Conference on Intelligence and Security Informatics (ISI).

[72]  Craig Gentry,et al.  Implementing Gentry's Fully-Homomorphic Encryption Scheme , 2011, EUROCRYPT.

[73]  Jose M. Alcaraz Calero,et al.  Dynamic Cloud Deployment of a MapReduce Architecture , 2012, IEEE Internet Computing.

[74]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[75]  Zhifeng Xiao,et al.  Security and Privacy in Cloud Computing , 2013, IEEE Communications Surveys & Tutorials.

[76]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[77]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[78]  Cong Yu,et al.  Data Cube Materialization and Mining over MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[79]  Athanasios V. Vasilakos,et al.  Security in cloud computing: Opportunities and challenges , 2015, Inf. Sci..

[80]  Jian Pei,et al.  Privacy Preserving Publishing on Multiple Quasi-identifiers , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[81]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[82]  Rajkumar Buyya,et al.  Cloud Computing Principles and Paradigms , 2011 .

[83]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[84]  Anna Cinzia Squicciarini,et al.  Toward Detecting Compromised MapReduce Workers through Log Analysis , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[85]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[86]  Dawn M. Cappelli,et al.  Insider Threat Study: Illicit Cyber Activity in the Banking and Finance Sector , 2005 .

[87]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[88]  Craig Gentry,et al.  Fully homomorphic encryption using ideal lattices , 2009, STOC '09.

[89]  William Stallings,et al.  Cryptography and network security , 1998 .

[90]  Athanasios V. Vasilakos,et al.  Multimedia Applications and Security in MapReduce: Opportunities and Challenges , 2012, Concurr. Comput. Pract. Exp..

[91]  Travis Mayberry,et al.  PIRMAP: Efficient Private Information Retrieval for MapReduce , 2013, Financial Cryptography.

[92]  Andrew P. Martin,et al.  TMR: Towards a Trusted MapReduce Infrastructure , 2012, 2012 IEEE Eighth World Congress on Services.

[93]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[94]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[95]  György Turán,et al.  On the Computational Complexity of MapReduce , 2015, DISC.

[96]  Wenliang Du,et al.  Uncheatable grid computing , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[97]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[98]  Bhavani M. Thuraisingham,et al.  A Token-Based Access Control System for RDF Data in the Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[99]  Yongli Zhu,et al.  Cache conscious star-join in MapReduce environments , 2013, Cloud-I '13.

[100]  Guevara Noubir,et al.  EPiC: efficient privacy-preserving counting for MapReduce , 2018, Computing.

[101]  Kamesh Munagala,et al.  Complexity Measures for Map-Reduce, and Comparison to Parallel Computing , 2012, ArXiv.

[102]  Patrick Th. Eugster,et al.  Practical Confidentiality Preserving Big Data Analysis , 2014, HotCloud.

[103]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[104]  Xiaojiang Du,et al.  Prometheus: Privacy-aware data retrieval on hybrid cloud , 2013, 2013 Proceedings IEEE INFOCOM.

[105]  Pascal Paillier,et al.  Public-Key Cryptosystems Based on Composite Degree Residuosity Classes , 1999, EUROCRYPT.

[106]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[107]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[108]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[109]  Gilles Fedak,et al.  Distributed Results Checking for MapReduce in Volunteer Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[110]  R. Motwani,et al.  Efficient Algorithms for Masking and Finding Quasi-Identifiers , 2007 .

[111]  Huaimin Wang,et al.  VAWS: Constructing Trusted Open Computing System of MapReduce with Verified Participants , 2014, IEICE Trans. Inf. Syst..

[112]  Huaimin Wang,et al.  Trusted Sampling-Based Result Verification on Mass Data Processing , 2013, 2013 IEEE Seventh International Symposium on Service-Oriented System Engineering.

[113]  Murat Kantarcioglu,et al.  AccountableMR: Toward accountable MapReduce systems , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[114]  Xuyun Zhang,et al.  Privacy Preservation over Big Data in Cloud Systems , 2014 .

[115]  Murat Kantarcioglu,et al.  GuardMR: Fine-grained Security Policy Enforcement for MapReduce Systems , 2015, AsiaCCS.

[116]  Kawhale Rohitkumar,et al.  Data Cube Materialization Using Map Reduce , 2014 .

[117]  Thomas Heinis,et al.  THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads , 2015, SIGMOD Conference.

[118]  Mudhakar Srivatsa,et al.  Result Integrity Check for MapReduce Computation on Hybrid Clouds , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[119]  Sencun Zhu,et al.  Towards Trusted Services: Result Verification Schemes for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[120]  Helger Lipmaa,et al.  An Oblivious Transfer Protocol with Log-Squared Communication , 2005, ISC.

[121]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[122]  Roland H. C. Yap,et al.  Tagged-MapReduce: A General Framework for Secure Computing with Mixed-Sensitivity Data on Hybrid Clouds , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[123]  Kevin W. Hamlen,et al.  Hatman: Intra-cloud Trust Management for Hadoop , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[124]  Rupak Majumdar,et al.  MrCrypt: static analysis for secure cloud computations , 2013, OOPSLA.

[125]  Ioana Manolescu,et al.  RDF in the clouds: a survey , 2014, The VLDB Journal.

[126]  Jinjun Chen,et al.  A MapReduce Based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Cloud , 2013, 2013 International Conference on Cloud and Green Computing.

[127]  Roberto Di Pietro,et al.  PRISM - Privacy-Preserving Search in MapReduce , 2012, Privacy Enhancing Technologies.

[128]  Kan Zhang,et al.  Adding Security to Apache Hadoop , 2017 .

[129]  Ninghui Li,et al.  Purpose based access control for privacy protection in relational database systems , 2008, The VLDB Journal.

[130]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[131]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.