Online load balancing for MapReduce with skewed data input

MapReduce has emerged as a powerful tool for distributed and scalable processing of voluminous data. In this paper, we, for the first time, examine the problem of accommodating data skew in MapReduce with online operations. Different from earlier heuristics in the very late reduce stage or after seeing all the data, we address the skew from the beginning of data input, and make no assumption about a priori knowledge of the data distribution nor require synchronized operations. We examine the input in a continuous fashion and adaptively assign tasks with a load-balanced strategy. We show that the optimal strategy is a constrained version of online minimum makespan and, in the MapReduce context where pairs with identical keys must be scheduled to the same machine, there is an online algorithm with a provable 2-competitive ratio. We further suggest a sample-based enhancement, which, probabilistically, achieves a 3/2-competitive ratio with a bounded error.

[1]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[2]  K. Mani Chandy,et al.  A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.

[3]  Hironori Kasahara,et al.  Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing , 1984, IEEE Transactions on Computers.

[4]  Honesty C. Young,et al.  A Symmetric Fragment and Replicate Algorithm for Distributed Joins , 1993, IEEE Trans. Parallel Distributed Syst..

[5]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[6]  Rudolf Fleischer,et al.  On‐line scheduling revisited , 2000 .

[7]  Rudolf Fleischer,et al.  Online Scheduling Revisited , 2000, ESA.

[8]  Ramaswamy Chandrasekaran,et al.  Improved Bounds for the Online Scheduling Problem , 2003, SIAM J. Comput..

[9]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[10]  Éva Tardos,et al.  Algorithm design , 2005 .

[11]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[12]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[13]  Matthias Englert,et al.  The Power of Reordering for Online Minimum Makespan Scheduling , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[14]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[15]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[16]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[17]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[18]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[19]  Jignesh M. Patel,et al.  Energy management for MapReduce clusters , 2010, Proc. VLDB Endow..

[20]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[21]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[22]  Christoforos E. Kozyrakis,et al.  On the energy (in)efficiency of Hadoop clusters , 2010, OPSR.

[23]  Magdalena Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[24]  Prashant J. Shenoy,et al.  A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[25]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[26]  D. DeWitt MapReduce: A major step backwards | The Database Column , 2011 .

[27]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[28]  Garret Swart,et al.  Balancing reducer skew in MapReduce workloads using progressive sampling , 2012, SoCC '12.

[29]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[30]  Murali S. Kodialam,et al.  Joint scheduling of processing and Shuffle phases in MapReduce systems , 2012, 2012 Proceedings IEEE INFOCOM.

[31]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[32]  Xiaoqiao Meng,et al.  Coupling task progress for MapReduce resource-aware scheduling , 2013, 2013 Proceedings IEEE INFOCOM.

[33]  Haiyang Wang,et al.  On interference-aware provisioning for cloud-based big data processing , 2013, 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).

[34]  Michael Stonebraker,et al.  MapReduce: A major step backwards , 2014 .

[35]  Lisa Werner,et al.  Probability And Statistics For Engineering And The Sciences , 2016 .