Avoiding Slow Running Nodes in Distributed Systems

In distributed systems like Hadoop, work is segmented into various tasks and then subsequently executed in parallel on nodes in the cluster. Stragglers, the nodes which are 6–8 times slower than median nodes, can potentially degrade the overall cluster performance by increasing the job completion time. The existing solutions mainly concentrate on reactive measures after detecting stragglers but they lead to extended job completion time and resource wastage. Currently, proactive straggler avoidance techniques have introduced the application of machine learning methods to enhance the task scheduling. In this paper, a prognostic system that proactively avoids stragglers using predictive models is proposed. It has two stages: (1) To develop the prediction model for identifying the straggler nodes before allocation of the task using distributed machine learning and (2) To guide the scheduler to efficiently assign the tasks. This results in avoiding or minimizing the number of stragglers and leads to smarter scheduling. The proposed solution is compared with default Hadoop scheduler and has shown the significant improvement.