Optimized Speculative Execution Strategy for Different Workload Levels in Heterogeneous Spark Cluster

Spark is a big data processing framework based on MapReduce, whose calculation model requires that all tasks in all parent stages are completed before starting a new stage. Machine service variability or congested network connections caused by partial or intermittent machine failures become a bottleneck for the Spark framework to execute tasks. In this paper, we focus on the design of speculative execution schemes for heterogeneous Spark from an optimization perspective on different loading conditions. First, we derive the load arrival rate threshold for different operating regimes. Second, for the lightly loaded case, we analyze and propose the speculative execution based on task-cloning algorithm (SETC) which reduce the application completion time by maximizing the overall system utility. Then, for the heavily loaded case, we propose the speculative execution based on straggler-detection algorithm(SESD), which aims to mitigate stragglers. Finally, we conduct experiments to verify the performance of SETC and SESD. Results show that our method is faster than Spark-Speculation, LATE, and SCA by16.7%, 8.2%, and 11.7%. Also it outperforms the baseline algorithms in some metric aspect such as the cluster throughput.

[1]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[2]  Poonam Saini,et al.  Deadline-aware MapReduce scheduling with selective speculative execution , 2017, 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[3]  Hongbin Yang,et al.  Improving Spark performance with MPTE in heterogeneous environments , 2016, 2016 International Conference on Audio, Language and Image Processing (ICALIP).

[4]  Wing Cheong Lau,et al.  Optimization for Speculative Execution in Big Data Processing Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[5]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[6]  Xia Zhao,et al.  Insight and reduction of MapReduce stragglers in heterogeneous environment , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Kenli Li,et al.  A Heuristic Speculative Execution Strategy in Heterogeneous Distributed Environments , 2014, 2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming.

[8]  Xiaodong Liu,et al.  A Survey of Speculative Execution Strategy in MapReduce , 2016, ICCCS.

[9]  Jiangchuan Liu,et al.  Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments , 2016, J. Parallel Distributed Comput..

[10]  Peter G. Harrison,et al.  Variability-aware request replication for latency curtailment , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[11]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[12]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[15]  Wing Cheong Lau,et al.  Task-Cloning Algorithms in a MapReduce Cluster with Competitive Performance Bounds , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[16]  Reynold Xin,et al.  Apache Spark , 2016 .

[17]  Gustavo de Veciana,et al.  Mitigating Service Variability in MapReduce Clusters via Task Cloning: A Competitive Analysis , 2017, IEEE Transactions on Parallel and Distributed Systems.