Flame-MR: An event-driven architecture for MapReduce applications

Nowadays, many organizations analyze their data with the MapReduce paradigm, most of them using the popular Apache Hadoop framework. As the data size managed by MapReduce applications is steadily increasing, the need for improving the Hadoop performance also grows. Existing modifications of Hadoop (e.g., Mellanox Unstructured Data Accelerator) attempt to improve performance by changing some of its underlying subsystems. However, they are not always capable to cope with all its performance bottlenecks or they hinder its portability. Furthermore, new frameworks like Apache Spark or DataMPI can achieve good performance improvements, but they do not keep compatibility with existing MapReduce applications. This paper proposes Flame-MR, a new event-driven MapReduce architecture that increases Hadoop performance by avoiding memory copies and pipelining data movements, without modifying the source code of the applications. The performance evaluation on two representative systems (an HPC cluster and a public cloud platform) has shown experimental evidence of significant performance increases, reducing the execution time by up to 54% on the Amazon EC2 cloud. Description of Flame-MR, a new MapReduce framework that improves the performance and resource efficiency of Hadoop.Flame-MR keeps Hadoop API compatibility in order to avoid source code modifications.Performance comparison with Hadoop-based frameworks using representative workloads on an HPC cluster and a cloud platform.Flame-MR reduces Hadoop execution times by up to 34% for the selected micro-benchmarks and 54% for the application benchmarks.

[1]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[2]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[3]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[4]  David Jones High performance , 1989, Nature.

[5]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[8]  Juan Touriño,et al.  Analysis and evaluation of MapReduce solutions on an HPC cluster , 2016, Comput. Electr. Eng..

[9]  Seungwoo Jeon,et al.  Monte Carlo simulation-based traffic speed forecasting using historical big data , 2016, Future Gener. Comput. Syst..

[10]  Juan Touriño,et al.  MREv: An Automatic MapReduce Evaluation Tool for Big Data Workloads , 2015, ICCS.

[11]  Zhiwei Xu,et al.  DataMPI: Extending MPI to Hadoop-Like Big Data Computing , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Sabela Ramos,et al.  Performance analysis of HPC applications in the cloud , 2013, Future Gener. Comput. Syst..

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[14]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[15]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Dong Yang,et al.  NativeTask: A Hadoop compatible framework for high performance , 2013, 2013 IEEE International Conference on Big Data.

[18]  Dhabaleswar K. Panda,et al.  High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[19]  Keqin Li,et al.  A task-level adaptive MapReduce framework for real-time streaming data in healthcare applications , 2015, Future Gener. Comput. Syst..

[20]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Lavanya Ramakrishnan,et al.  MARIANE: Using MApReduce in HPC environments , 2014, Future Gener. Comput. Syst..

[22]  周鑫,et al.  Using Memory in the Right Way to Accelerate Big Data Processing , 2015 .

[23]  Weikuan Yu,et al.  Hadoop acceleration through network levitated merge , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).