Topology-aware network fault influence domain analysis

Incorporating failure-awareness into system management stack.The concept of network fault influence domain is proposed.The rules for topology-based fault influence analysis are established. Display Omitted The extremely high performance of supercomputers is derived from the coordination of a large number of compute nodes. As a consequence, the communication subsystem significantly affects the overall system performance. A single router or link breakdown in the interconnection network may affect a group of tasks. The rapid increase of system scale makes this problem even worse. However, impacts of network faults are typically highly skew on different parts of the system. On the occurrence of a network fault, there could be a subset of compute nodes, among which the fault influence could be ignored. With this intuition, we designed FIDA, a network fault influence domain analysis tool, which infers which part of the system suffers most severely from the fault. The influence domain given by FIDA will be further delivered to the resource management subsystem as guidelines to allocate healthy nodes preferentially to achieve better performance.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[3]  Ishai Menache,et al.  Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can , 2015, SIGCOMM.

[4]  Philip Heidelberger,et al.  Blue Gene/L torus interconnection network , 2005, IBM J. Res. Dev..

[5]  Haixun Wang,et al.  Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[7]  Henri Casanova,et al.  Swap-And-Randomize: A Method for Building Low-Latency HPC Interconnects , 2015, IEEE Transactions on Parallel and Distributed Systems.

[8]  Song Fu,et al.  Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..

[9]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[10]  Barry Pangrle News on Energy-Efficient Large-Scale Computing , 2016 .

[11]  Gen Li,et al.  Iaso: an autonomous fault-tolerant management system for supercomputers , 2014, Frontiers of Computer Science.

[12]  Toshiyuki Shimizu,et al.  Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers , 2009, Computer.

[13]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[14]  Henri Casanova,et al.  Skywalk: A Topology for HPC Networks with Low-Delay Switches , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Mario Gerla,et al.  On the Topological Design of Distributed Computer Networks , 1977, IEEE Trans. Commun..

[16]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[17]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[18]  Kash Barker,et al.  Resilience-based network component importance measures , 2013, Reliab. Eng. Syst. Saf..

[19]  Gabriel Antoniu,et al.  Chronos: Failure-aware scheduling in shared Hadoop clusters , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[20]  Yi Zheng,et al.  The TH Express high performance interconnect networks , 2014, Frontiers of Computer Science.