IntegrityMR: Integrity assurance framework for big data analytics and management applications

Big data analytics and knowledge management is becoming a hot topic with the emerging techniques of cloud computing and big data computing model such as MapReduce. However, large-scale adoption of MapReduce applications on public clouds is hindered by the lack of trust on the participating virtual machines deployed on the public cloud. In this paper, we extend the existing hybrid cloud MapReduce architecture to multiple public clouds. Based on such architecture, we propose IntegrityMR, an integrity assurance framework for big data analytics and management applications. We explore the result integrity check techniques at two alternative software layers: the MapReduce task layer and the applications layer. We design and implement the system at both layers based on Apache Hadoop MapReduce and Pig Latin, and perform a series of experiments with popular big data analytics and management applications such as Apache Mahout and Pig on commercial public clouds (Amazon EC2 and Microsoft Azure) and local cluster environment. The experimental result of the task layer approach shows high integrity (98% with a credit threshold of 5) with non-negligible performance overhead (18% to 82% extra running time compared to original MapReduce). The experimental result of the application layer approach shows better performance compared with the task layer approach (less than 35% of extra running time compared with the original MapReduce).

[1]  Ting Yu,et al.  SecureMR: A Service Integrity Assurance Framework for MapReduce , 2009, 2009 Annual Computer Security Applications Conference.

[2]  Helen J. Wang,et al.  Enabling Security in Cloud Storage SLAs with CloudProof , 2011, USENIX Annual Technical Conference.

[3]  Ahmad-Reza Sadeghi,et al.  AmazonIA: when elasticity snaps back , 2011, CCS '11.

[4]  Mary Baker,et al.  Preserving peer replicas by rate-limited sampled voting , 2003, SOSP '03.

[5]  Philippe Golle,et al.  Secure Distributed Computing in a Commercial Environment , 2002, Financial Cryptography.

[6]  Jinpeng Wei,et al.  VIAF: Verification-Based Integrity Assurance Framework for MapReduce , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[7]  Robert Grimm,et al.  Ensuring Content Integrity for Untrusted Peer-to-Peer Content Distribution Networks , 2007, NSDI.

[8]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[9]  Philippe Golle,et al.  Uncheatable Distributed Computations , 2001, CT-RSA.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Wenliang Du,et al.  Uncheatable grid computing , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[12]  Mudhakar Srivatsa,et al.  Result Integrity Check for MapReduce Computation on Hybrid Clouds , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[13]  Ari Juels,et al.  HAIL: a high-availability and integrity layer for cloud storage , 2009, CCS.

[14]  Chris GauthierDickey,et al.  Result verification and trust-based scheduling in peer-to-peer grids , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).