ImRP: A Predictive Partition Method for Data Skew Alleviation in Spark Streaming Environment

Abstract Spark Streaming is an extension of the core Spark engine that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It treats stream as a series of deterministic batches and handles them as regular jobs. However, for a stream job responsible for a batch, data skew (i.e., the imbalance in the amount of data allocated to each reduce task), can degrade the job performance significantly because of load imbalance. In this paper, we propose an improved range partitioner (ImRP) to alleviate the reduce skew for stream jobs in Spark Streaming. Unlike previous work, ImRP does not require any pre-run sampling of input data and generates the data partition scheme based on the intermediate data distribution estimated by the previous batch processing, in which a prediction model EWMA (Exponentially Weighted Moving Average) is adopted. To lighten the data skew, ImRP presents a novel method of calculating the partition borders optimally, and a mechanism of splitting the border key clusters when the semantics of shuffle operators permit. Besides, ImRP considers the integrated partition size and heterogeneity of computing environments when balancing the load among reduce tasks appropriately. We implement ImRP in Spark-3.0 and evaluate its performance on four representative benchmarks: wordCount, sort, pageRank, and LDA. The results show that by mitigating the data skew, ImRP can decrease the execution time of stream jobs substantially compared with some other partition strategies, especially when the skew degree of input batch is serious.

[1]  Changjun Jiang,et al.  Adaptive Scheduling Parallel Jobs with Dynamic Batching in Spark Streaming , 2018, IEEE Transactions on Parallel and Distributed Systems.

[2]  Chen Chen,et al.  Cost-effective Resource Provisioning for Spark Workloads , 2019, CIKM.

[3]  Wenxin Li,et al.  Wide-Area Spark Streaming: Automated Routing and Batch Sizing , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[4]  Jeffrey F. Naughton,et al.  Adaptive parallel aggregation algorithms , 1995, SIGMOD '95.

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[7]  Zhang De-xin Big Data Research , 2013 .

[8]  Weiwei Xing,et al.  MRSIM: Mitigating Reducer Skew In MapReduce , 2017, 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA).

[9]  Fei Hu,et al.  SASM: Improving spark performance with Adaptive Skew Mitigation , 2015, 2015 IEEE International Conference on Progress in Informatics and Computing (PIC).

[10]  Xiaomin Zhu,et al.  SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming , 2017, Future Gener. Comput. Syst..

[11]  Norbert Ritter,et al.  Real-time stream processing for Big Data , 2016, it Inf. Technol..

[12]  Keqin Li,et al.  A Data Skew Oriented Reduce Placement Algorithm Based on Sampling , 2020, IEEE Transactions on Cloud Computing.

[13]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[14]  Nikolaus Augsten,et al.  Handling Data Skew in MapReduce , 2011, CLOSER.

[15]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[16]  Funda Ergün,et al.  Online load balancing for MapReduce with skewed data input , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[17]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[20]  Patrick Valduriez,et al.  A survey of scheduling frameworks in big data systems , 2018, Int. J. Cloud Comput..

[21]  Kenli Li,et al.  An intermediate data placement algorithm for load balancing in Spark computing environment , 2018, Future Gener. Comput. Syst..

[22]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[23]  Zhuo Tang,et al.  Optimizing Speculative Execution in Spark Heterogeneous Environments , 2019 .

[24]  Viswanath Poosala,et al.  Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[25]  P H Ellaway,et al.  Cumulative sum technique and its application to the analysis of peristimulus time histograms. , 1978, Electroencephalography and clinical neurophysiology.

[26]  Keqiu Li,et al.  Sampling-Based Partitioning in MapReduce for Skewed Data , 2012, 2012 Seventh ChinaGrid Annual Conference.

[27]  Kenli Li,et al.  An Intermediate Data Partition Algorithm for Skew Mitigation in Spark Computing Environment , 2018, IEEE Transactions on Cloud Computing.

[28]  Joanna Berlinska,et al.  Comparing load-balancing algorithms for MapReduce under Zipfian data skews , 2018, Parallel Comput..

[29]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[30]  Yu Xu,et al.  A new algorithm for small-large table outer joins in parallel DBMS , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[31]  Xin Huang,et al.  Novel heuristic speculative execution strategies in heterogeneous distributed environments , 2016, Comput. Electr. Eng..

[32]  Jordi Torres,et al.  A Methodology for Spark Parameter Tuning , 2017, Big Data Res..

[33]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[34]  Matei A. Zaharia,et al.  An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[35]  Zhen Xiao,et al.  LIBRA: Lightweight Data Skew Mitigation in MapReduce , 2015, IEEE Transactions on Parallel and Distributed Systems.

[36]  Hai Jin,et al.  Handling partitioning skew in MapReduce using LEEN , 2013, Peer Peer Netw. Appl..