Dynamic Resource Allocation for MapReduce with Partitioning Skew

MapReduce has become a prevalent programming model for building data processing applications in the cloud. While being widely used, existing MapReduce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. Existing solutions follow a similar principle that repartitions workload among reduce tasks. However, those approaches often incur high performance overhead due to the partition size prediction and repartitioning. In this paper, we present DREAMS, a framework that provides run-time partitioning skew mitigation. Instead of repartitioning workload among reduce tasks, we cope with the partitioning skew problem by controlling the amount of resources allocated to each reduce task. Our approach completely eliminates the repartitioning overhead, yet is simple to implement. Experiments using both real and synthetic workloads running on a 21-node Hadoop cluster demonstrate that DREAMS can effectively mitigate the negative impact of partitioning skew, thereby improving the job completion time by up to a factor of <inline-formula><tex-math notation="LaTeX"> $2.29$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="liu-ieq1-2532860.gif"/></alternatives> </inline-formula> over the native Hadoop YARN. Compared to the state-of-the-art solution, DREAMS can improve the job completion time by a factor of <inline-formula><tex-math notation="LaTeX">$1.65$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="liu-ieq2-2532860.gif"/></alternatives></inline-formula>.

[1]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Raj Jain,et al.  A Quantitative Measure Of Fairness And Discrimination For Resource Allocation In Shared Computer Systems , 1998, ArXiv.

[4]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[5]  Yuan Xue,et al.  Scalable and robust key group size estimation for reducer load balancing in MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[6]  Mohammad Hammoud,et al.  Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[7]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[8]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[9]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..

[10]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[11]  Boon Thau Loo,et al.  AutoTune: Optimizing Execution Concurrency and Resource Usage in MapReduce Workflows , 2013, ICAC.

[12]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[13]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[14]  Raouf Boutaba,et al.  Mitigating the negative impact of preemption on heterogeneous MapReduce workloads , 2011, 2011 7th International Conference on Network and Service Management.

[15]  Randy H. Katz,et al.  Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[18]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[19]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[20]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[21]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[22]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[23]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[24]  Boon Thau Loo,et al.  Benchmarking approach for designing a mapreduce performance model , 2013, ICPE '13.

[25]  Mohamed Faten Zhani,et al.  DREAMS: Dynamic resource allocation for MapReduce with data skew , 2015, 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM).

[26]  Mahmut T. Kandemir,et al.  MROrchestrator: A Fine-Grained Resource Orchestration Framework for MapReduce Clusters , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[27]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[28]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[29]  Vana Kalogeraki,et al.  Real-Time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments , 2014, ICAC.

[30]  Kenli Li,et al.  A self-adaptive scheduling algorithm for reduce start time , 2015, Future Gener. Comput. Syst..

[31]  Kun-Lung Wu,et al.  FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads , 2010, Middleware.

[32]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[33]  Archana Ganapathi,et al.  To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency , 2010, Green Networking '10.

[34]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.