A Modeling Language for MapReduce Programing in a Storage System Perspective

MapReduce is a powerful distributed data analysis programming model. It runs on big data storage systems and processes data in a parallel way. An appropriate way to ensure the correctness of MapReduce programs is formal method analysis, which requires firstly a formal model of MapReduce. In this paper we propose a modeling language to establish the formal model of the MapReduce framework. Unlike other approaches, our language describes the processing of data in the MapReduce programs from a perspective of underlying files and blocks, so that the details of data processing can be clearly demonstrated. The language is based on our previous work, a language describing the management of massive data storage systems, with extensions from two aspects: block content data refinement and concurrency support. Based on our language, the features of the MapReduce programming model can be discussed.

[1]  Jon Feldman,et al.  On distributing symmetric streaming computations , 2008, SODA '08.

[2]  C. A. R. Hoare,et al.  An axiomatic basis for computer programming , 1969, CACM.

[3]  Diego Pérez Leándrez,et al.  Formal performance evaluation of the Map/Reduce framework within cloud computing , 2015, The Journal of Supercomputing.

[4]  Stephen D. Brookes Full Abstraction for a Shared-Variable Parallel Language , 1996, Inf. Comput..

[5]  Elena Troubitsyna,et al.  Formal Derivation of Distributed MapReduce , 2014, ABZ.

[6]  Jun Sun,et al.  Towards Formal Modeling and Verification of Cloud Architectures: A Case Study on Hadoop , 2013, 2013 IEEE Ninth World Congress on Services.

[7]  Rüdiger Valk,et al.  Petri nets for systems engineering - a guide to modeling, verification, and applications , 2010 .

[8]  Meikang Qiu,et al.  Online optimization for scheduling preemptable tasks on IaaS cloud systems , 2012, J. Parallel Distributed Comput..

[9]  Stephen D. Brookes,et al.  A Semantics for Concurrent Separation Logic , 2004, CONCUR.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  M. Carmen Ruiz,et al.  Petri Nets Formalization of Map/Reduce Paradigm to Optimise the Performance-Cost Tradeoff , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[12]  Qin Li,et al.  Modeling MapReduce with CSP , 2009, 2009 Third IEEE International Symposium on Theoretical Aspects of Software Engineering.

[13]  Per Brinch Hansen,et al.  Structured multiprogramming , 1972, CACM.

[14]  Qin Li,et al.  Formalizing MapReduce with CSP , 2010, 2010 17th IEEE International Conference and Workshops on Engineering of Computer Based Systems.

[15]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[16]  Mercedes G. Merayo,et al.  A formal framework to analyze cost and performance in Map-Reduce based applications , 2014, J. Comput. Sci..

[17]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[18]  Zhi Chen,et al.  Clustering scheduling for hardware tasks in reconfigurable computing systems , 2013, J. Syst. Archit..

[19]  Tommaso Di Noia,et al.  A Computational Model for Mapreduce Job Flow , 2014, CILC.

[20]  Meikang Qiu,et al.  A Review on Cloud Computing: Design Challenges in Architecture and Security , 2011, J. Comput. Inf. Technol..

[21]  Edsger W. Dijkstra,et al.  Cooperating sequential processes , 2002 .

[22]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[23]  Panos Rondogiannis,et al.  Tagged Dataflow: a Formal Model for Iterative Map-Reduce , 2014, EDBT/ICDT Workshops.

[24]  Pramod Bhatotia,et al.  Brief announcement: modelling MapReduce for optimal execution in the cloud , 2010, PODC.

[25]  Robin Milner,et al.  Communication and concurrency , 1989, PHI Series in computer science.

[26]  Masami Hagiya,et al.  Using Coq in Specification and Program Extraction of Hadoop MapReduce Applications , 2011, SEFM.

[27]  John C. Reynolds,et al.  Separation logic: a logic for shared mutable data structures , 2002, Proceedings 17th Annual IEEE Symposium on Logic in Computer Science.

[28]  Charles Antony Richard Hoare Towards a theory of parallel programming , 2002 .

[29]  Yu Huang,et al.  A modeling language to describe massive data storage management in cyber-physical systems , 2017, J. Parallel Distributed Comput..

[30]  Robin Milner,et al.  A Calculus of Communicating Systems , 1980, Lecture Notes in Computer Science.