A novel compression algorithm decision method for spark shuffle process

With the wide application of Spark big data platform, some problems in practical application are exposed, and one of the main problems is performance optimization. The Shuffle module of Spark is one of the core modules of Spark, and it is also an important module of some other distributed big data computing frameworks. The design of Shuffle module is the key factor that directly determines the performance of big data computing framework. The main optimization parameters of Shuffle process involve the CPU utilization, I/O literacy rate, network transmission rate, and one of these factors is likely to be the bottleneck during the execution of application. The network data transmission time consumption, I/O read and write time, and the CPU utilization are closely related with the size of the data processing. As a result, Spark provides compression configuration options and different compression algorithms for users to select. Different compression algorithms have different effects in compression rate and compression ratio, but the default configuration is usually selected by all users even though they run different applications, so the optimal configuration cannot be achieved. In order to achieve the optimal configuration of compression algorithm for the Shuffle process, one cost optimization model for Spark Shuffle process is proposed in this paper, which enables users to get the best compression configuration before application execution. The experimental results show that the prediction model for compression configuration has an accuracy of 58.3%, and the proposed cost optimization model can improve the performance by 48.9%.

[1]  Aniruddha S. Gokhale,et al.  A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration , 2013, 20th Annual International Conference on High Performance Computing.

[2]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[3]  Jungang Xu,et al.  A Novel Performance Evaluation and Optimization Model for Big Data System , 2016, 2016 15th International Symposium on Parallel and Distributed Computing (ISPDC).

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[6]  Matei A. Zaharia,et al.  An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[7]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Dili Wu A profiling and performance analysis based self-tuning system for optimization of Hadoop MapReduce cluster configuration , 2013 .

[9]  Herodotos Herodotou Hadoop Performance Models , 2011, ArXiv.

[10]  Kewen Wang,et al.  Performance Prediction for Apache Spark Platform , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[11]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.