论文信息 - Data Mining Based Root-Cause Analysis of Performance Bottleneck for Big Data Workload

Data Mining Based Root-Cause Analysis of Performance Bottleneck for Big Data Workload

Straggler task is commonly considered as the major bottleneck in parallel data processing. Previous work mainly focuses on the coarse-grained straggler detection and optimization such as speculative scheduling. However, fine-grained root-cause analysis of straggler tasks is rarely considered. In addition, existing work simply depends on empirical analysis, which lacks of useful guidance to performance optimization. In this paper, we propose a new methodology of fine-grained straggler root-cause analysis using machine learning. We collect raw metrics from Spark event log and hardware sampling tool, and refine them into high-level metrics for model learning. Then we present the root-cause analysis of stragglers through CART tree. A customized prune method is also applied to improve analysis accuracy. From the analysis, we derive several new findings beyond the well known causes of stragglers. Our work provides a new perspective on identifying and understanding the inefficiency in parallel data processing programs by applying machine learning techniques to fine-grained root-cause analysis of straggler tasks.

Wei Li | Yunchun Li | Hailong Yang | Honggang Zhou | Weichen Qi

[1] Jie Xu,et al. Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters , 2019, IEEE Transactions on Services Computing.

[2] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[3] Amin Vahdat,et al. Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[4] Albert G. Greenberg,et al. Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[5] Magdalena Balazinska,et al. SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[6] Hitesh Ballani,et al. Towards predictable datacenter networks , 2011, SIGCOMM 2011.

[7] Jie Huang,et al. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[8] Lei Zhang,et al. Review of hadoop performance optimization , 2016, 2016 2nd IEEE International Conference on Computer and Communications (ICCC).

[9] Chengzhong Xu,et al. Performance Modeling for Spark Using SVM , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[10] Rajeev Gandhi,et al. Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[11] Zheng Hu,et al. Learning-Based Characterizing and Modeling Performance Bottlenecks of Big Data Workloads , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[12] Scott Shenker,et al. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[13] Randy H. Katz,et al. Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[14] Zhen Xiao,et al. Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[15] Chen Wang,et al. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[16] Weisong Shi,et al. Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[17] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[18] Eshcar Hillel,et al. Predicting Execution Bottlenecks in Map-Reduce Clusters , 2012, HotCloud.

[19] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[20] Jianfeng Zhan,et al. Characterization and architectural implications of big data workloads , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).