Benchmarking MapReduce implementations under different application scenarios

Abstract The MapReduce paradigm provides a scalable model for large scale data intensive computing and associated fault-tolerance. Data volumes generated and processed by scientific applications are growing rapidly. Several MapReduce implementations, with various degrees of conformance to the key tenets of the model, are available today. Each of these implementations is optimized for specific features. To make the right decisions, HPC application and middleware developers must thus understand the complex dependences between MapReduce features and their application. We present a set of benchmarks for quantifying, comparing, and contrasting the performance of MapReduce implementations under a wide range of representative use cases. To demonstrate the utility of the benchmarks and to provide a snapshot of the current implementation landscape, we report the performance of three different MapReduce implementations, and draw conclusions about their current performance characteristics. The three implementations we chose for evaluation are the widely used Hadoop implementation, Twister, which has been widely discussed in the literature in the context of scientific applications, and LEMO-MR which is our own implementation. We present the performance of these three implementations and draw conclusions about their performance characteristics.

[1]  Dag Johansen,et al.  Cogset vs. Hadoop: Measurements and Analysis , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[2]  Lavanya Ramakrishnan,et al.  Benchmarking MapReduce Implementations for Application Usage Scenarios , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[3]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[5]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[6]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[7]  Geoffrey C. Fox,et al.  NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids , 2003, Middleware.

[8]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[9]  Dag Johansen,et al.  Cogset: A Unified Engine for Reliable Storage and Parallel Processing , 2009, 2009 Sixth IFIP International Conference on Network and Parallel Computing.

[10]  Chao Tian,et al.  A Dynamic MapReduce Scheduler for Heterogeneous Workloads , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[11]  Lavanya Ramakrishnan,et al.  MARISSA: MApReduce Implementation for Streaming Science Applications , 2012, 2012 IEEE 8th International Conference on E-Science.

[12]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[15]  Lavanya Ramakrishnan,et al.  Magellan: experiences from a science cloud , 2011, ScienceCloud '11.

[16]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[17]  GhemawatSanjay,et al.  The Google file system , 2003 .

[18]  Michael R. Head,et al.  Parallel and distributed approach for processing large-scale XML datasets , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[19]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[20]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[21]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[22]  Madhusudhan Govindaraju,et al.  LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.