Towards Provenance-Based Anomaly Detection in MapReduce

MapReduce enables parallel and distributed processing of vast amount of data on a cluster of machines. However, such computing paradigm is subject to threats posed by malicious and cheating nodes or compromised user submitted code that could tamper data and computation since users maintain little control as the computation is carried out in a distributed fashion. In this paper, we focus on the analysis and detection of anomalies during the process of MapReduce computation. Accordingly, we develop a computational provenance system that captures provenance data related to MapReduce computation within the MapReduce framework in Hadoop. In particular, we identify a set of invariants against aggregated provenance information, which are later analyzed to uncover anomalies indicating possible tampering of data and computation. We conduct a series of experiments to show the efficiency and effectiveness of our proposed provenance system.

[1]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[2]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[3]  Anna Cinzia Squicciarini,et al.  Toward Detecting Compromised MapReduce Workers through Log Analysis , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Jennifer Widom,et al.  RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows , 2011, Proc. VLDB Endow..

[5]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[6]  Rajeev Gandhi,et al.  Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[7]  Marianne Winslett,et al.  Towards a Secure and Efficient System for End-to-End Provenance , 2010, TaPP.

[8]  Ting Yu,et al.  SecureMR: A Service Integrity Assurance Framework for MapReduce , 2009, 2009 Annual Computer Security Applications Conference.

[9]  Margo I. Seltzer,et al.  Securing Provenance , 2008, HotSec.

[10]  Jinpeng Wei,et al.  VIAF: Verification-Based Integrity Assurance Framework for MapReduce , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[11]  Shouhuai Xu,et al.  An Access Control Language for a General Provenance Model , 2009, Secure Data Management.

[12]  Ning Zhang,et al.  Security issues relating to inadequate authentication in MapReduce applications , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[13]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[14]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[15]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[16]  Bu-Sung Lee,et al.  How to Track Your Data: The Case for Cloud Computing Provenance , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[17]  Andy Hopper,et al.  HadoopProv: Towards Provenance as a First Class Citizen in MapReduce , 2013, TaPP.

[18]  Kevin R. B. Butler,et al.  Towards secure provenance-based access control in cloud environments , 2013, CODASPY.

[19]  Sencun Zhu,et al.  Towards Trusted Services: Result Verification Schemes for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[20]  Marianne Winslett,et al.  The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance , 2009, FAST.

[21]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[22]  Margo I. Seltzer,et al.  Provenance as first class cloud data , 2010, OPSR.

[23]  Adriane Chapman,et al.  Scalable Access Controls for Lineage , 2009, Workshop on the Theory and Practice of Provenance.

[24]  Marianne Winslett,et al.  Introducing secure provenance: problems and challenges , 2007, StorageSS '07.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Yang Xiao,et al.  Accountable MapReduce in cloud computing , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).