An Adaptive Partition Method for Handling Skew in Spark Applications

In the parallel computing framework of Hadoop/Spark, data skew is a common problem resulting in performance degradation, such as prolonging of the entire execution time and idle resources. What lies behind this issue is partition imbalance, which causes significant differences in the amount of data processed by each reduce task. This paper proposes a key reassigning and splitting partition algorithm (SKRSP) to handle skew, which considers both the partition balance of the intermediate data and the partition balance after shuffle operator. We design two partition algorithms for different applications: the range-based key splitting partition method (KSRP) for sort operation and hash-based key reassigning partition method (KRHP) for the other operations. We implement SKRSP in Spark 2.2.0 and evaluate its performance through three benchmarks exhibiting significant data skew: Sort, Join, and PageRank. The experimental results verify that our algorithm not only can achieve a better partition balance but also reduce the execution time of reduce tasks effectively.

[1]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[2]  Kenli Li,et al.  A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[4]  Yu Xu,et al.  A new algorithm for small-large table outer joins in parallel DBMS , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[6]  Xiaomin Zhu,et al.  SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[7]  Fei Hu,et al.  SASM: Improving spark performance with Adaptive Skew Mitigation , 2015, 2015 IEEE International Conference on Progress in Informatics and Computing (PIC).

[8]  Jeffrey Scott Vitter,et al.  Faster methods for random sampling , 1984, CACM.

[9]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[10]  Xiaoqiao Meng,et al.  Coupling task progress for MapReduce resource-aware scheduling , 2013, 2013 Proceedings IEEE INFOCOM.

[11]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Liang Chen,et al.  Handling data skew in parallel joins in shared-nothing systems , 2008, SIGMOD Conference.

[13]  Nikolaus Augsten,et al.  Load Balancing in MapReduce Based on Scalable Cardinality Estimates , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[14]  Kenli Li,et al.  An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..

[15]  Spyros Kotoulas,et al.  Efficient Skew Handling for Outer Joins in a Cloud Computing Environment , 2018, IEEE Transactions on Cloud Computing.

[16]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[17]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[18]  Keqin Li,et al.  A Data Skew Oriented Reduce Placement Algorithm Based on Sampling , 2020, IEEE Transactions on Cloud Computing.

[19]  Spyros Kotoulas,et al.  Efficiently Handling Skew in Outer Joins on Distributed Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[20]  Mohammad Hammoud,et al.  Locality-Aware Reduce Task Scheduling for MapReduce , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[21]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[22]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[23]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[24]  D. Janaki Ram,et al.  Chisel: A Resource Savvy Approach for Handling Skew in MapReduce Applications , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.