Configuring a MapReduce Framework for Performance-Heterogeneous Clusters

When data centers employ the common and economical practice of upgrading subsets of nodes incrementally, rather than replacing or upgrading all nodes at once, they end up with clusters whose nodes have non-uniform processing capability, which we also call performance-heterogeneity. Popular frameworks supporting the effective MapReduce programming model for Big Data applications do not flexibly adapt to these environments. Instead, existing MapReduce frameworks, including Hadoop, typically divide data evenly among worker nodes, thereby inducing the well-known problem of stragglers on slower nodes. Our alternative MapReduce framework, called MARLA, divides each worker's labor into sub-tasks, delays the binding of data to worker processes, and thereby enables applications to run faster in performance-heterogeneous environments. This approach does introduce overhead, however. We explore and characterize the opportunity for performance gains, and identify when the benefits outweigh the costs. Our results suggest that frameworks should support finer grained sub-tasking and dynamic data partitioning when running on some performance-heterogeneous clusters. Blindly taking this approach in homogeneous clusters can slow applications down. Our study further suggests the opportunity for cluster managers to build performance-heterogeneous clusters by design, if they also run MapReduce frameworks that can exploit them.

[1]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[2]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[3]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[4]  Ripal Nathuji,et al.  Exploiting Platform Heterogeneity for Power Efficient Data Centers , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[5]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[6]  Lavanya Ramakrishnan,et al.  MARIANE: MApReduce Implementation Adapted for HPC Environments , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[7]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[8]  Madhusudhan Govindaraju,et al.  MARLA: MapReduce for Heterogeneous Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[9]  Lavanya Ramakrishnan,et al.  Benchmarking MapReduce Implementations for Application Usage Scenarios , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[10]  John M. Brewer,et al.  A Map/Reduce Parallelized Framework for Rapidly Classifying Astrophysical Transients , 2009 .