An Approach for Modeling and Ranking Node-Level Stragglers in Cloud Datacenters

The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.

[1]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[2]  Jie Xu,et al.  Analysis, Modeling and Simulation of Workload Patterns in a Large-Scale Utility Cloud , 2014, IEEE Transactions on Cloud Computing.

[3]  Chao Li,et al.  Fuxi: a Fault-Tolerant Resource Management and Job Scheduling System at Internet Scale , 2014, Proc. VLDB Endow..

[4]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[5]  Umesh Kumar,et al.  A Comprehensive Review of Straggler Handling Algorithms for MapReduce Framework , 2014 .

[6]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[7]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[8]  Jie Xu,et al.  Timely Long Tail Identification through Agent Based Monitoring and Analytics , 2015, 2015 IEEE 18th International Symposium on Real-Time Distributed Computing.

[9]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[10]  Pankesh Patel,et al.  Service Level Agreement in Cloud Computing , 2009 .

[11]  Neeraja J. Yadwadkar Proactive Straggler Avoidance using Machine Learning , 2012 .

[12]  Eshcar Hillel,et al.  Predicting Execution Bottlenecks in Map-Reduce Clusters , 2012, HotCloud.

[13]  Jie Xu,et al.  Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation , 2016, 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA).

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[16]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[17]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Lakshmi Sobhana Kalli,et al.  Market-Oriented Cloud Computing : Vision , Hype , and Reality for Delivering IT Services as Computing , 2013 .

[19]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[20]  Zhen Xiao,et al.  Improving MapReduce Performance Using Smart Speculative Execution Strategy , 2014, IEEE Transactions on Computers.

[21]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.