论文信息 - Avoiding Slow Running Nodes in Distributed Systems

Avoiding Slow Running Nodes in Distributed Systems

In distributed systems like Hadoop, work is segmented into various tasks and then subsequently executed in parallel on nodes in the cluster. Stragglers, the nodes which are 6–8 times slower than median nodes, can potentially degrade the overall cluster performance by increasing the job completion time. The existing solutions mainly concentrate on reactive measures after detecting stragglers but they lead to extended job completion time and resource wastage. Currently, proactive straggler avoidance techniques have introduced the application of machine learning methods to enhance the task scheduling. In this paper, a prognostic system that proactively avoids stragglers using predictive models is proposed. It has two stages: (1) To develop the prediction model for identifying the straggler nodes before allocation of the task using distributed machine learning and (2) To guide the scheduler to efficiently assign the tasks. This results in avoiding or minimizing the number of stragglers and leads to smarter scheduling. The proposed solution is compared with default Hadoop scheduler and has shown the significant improvement.

[1] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[2] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[3] Randy H. Katz,et al. Wrangler: Predictable and Faster Jobs using Fewer Resources , 2014, SoCC.

[4] Scott Shenker,et al. Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6] Qian Li,et al. Performance Prediction Model in Heterogeneous MapReduce Environments , 2014, 2014 IEEE International Conference on Computer and Information Technology.

[7] Randy H. Katz,et al. Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.