Benchmarking MapReduce Implementations for Application Usage Scenarios

The MapReduce paradigm provides a scalable model for large scale data-intensive computing and associated fault-tolerance. With data production increasing daily due to ever growing application needs, scientific endeavors, and consumption, the MapReduce model and its implementations need to be further evaluated, improved, and strengthened. Several MapReduce frameworks with various degrees of conformance to the key tenets of the model are available today, each, optimized for specific features. HPC application and middleware developers must thus understand the complex dependencies between MapReduce features and their application. We present a standard benchmark suite for quantifying, comparing, and contrasting the performance of MapReduce platforms under a wide range of representative use cases. We report the performance of three different MapReduce implementations on the benchmarks, and draw conclusions about their current performance characteristics. The three platforms we chose for evaluation are the widely used Apache Hadoop implementation, Twister, which has been discussed in the literature, and LEMO-MR, our own implementation. The performance analysis we perform also throws light on the available design decisions for future implementations, and allows Grid researchers to choose the MapReduce implementation that best suits their application's needs.

[1]  Yu Wang,et al.  FPMR: MapReduce framework on FPGA , 2010, FPGA '10.

[2]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[6]  Dag Johansen,et al.  Cogset vs. Hadoop: Measurements and Analysis , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[8]  Michael R. Head,et al.  Parallel and distributed approach for processing large-scale XML datasets , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[9]  Larry Peterson,et al.  Proceedings of the nineteenth ACM symposium on Operating systems principles , 2003, SOSP 2003.

[10]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[11]  Dag Johansen,et al.  Cogset: A Unified Engine for Reliable Storage and Parallel Processing , 2009, 2009 Sixth IFIP International Conference on Network and Parallel Computing.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[14]  Madhusudhan Govindaraju,et al.  DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[16]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[17]  Madhusudhan Govindaraju,et al.  LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[18]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[19]  Chao Tian,et al.  A Dynamic MapReduce Scheduler for Heterogeneous Workloads , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[20]  Geoffrey C. Fox,et al.  NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids , 2003, Middleware.