A3: An Automatic Topology-Aware Malfunction Detection and Fixation System in Data Center Networks

Link failures and cable miswirings are not uncommon in building data center networks, which prevents the existing automatic address configuration methods from functioning correctly. However, accurately detecting such malfunctions is not an easy task because there could be no observable node degree changes. Fixing or correcting such malfunctions is even harder as almost no work can provide accurate fixation suggestions now. To solve the problems, we design and implement A3, an automatic topology-aware malfunction detection and fixation system. A3 innovatively formulates the problem of finding minimal fixation to the problem of computing minimum graph difference (NP-hard) and solves it in O(k^6) and O(k^3) for any less than k/2 and k/4 undirected link malfunctions for FatTree, respectively. Our evaluation demonstrates that for less than k/2 undirected link malfunctions, A3 is 100% accurate for malfunction detection and provides the minimum fixation result. For greater or equal to k/2 undirected link malfunctions, A3 still has accuracy of about 100% and provides the near optimal fixation result.

[1]  Haitao Wu,et al.  Generic and automatic address configuration for data center networks , 2010, SIGCOMM 2010.

[2]  Hong Xu,et al.  Performance impact inference with failures in data center networks , 2016, 2016 IEEE/CIC International Conference on Communications in China (ICCC).

[3]  Wook-Shin Han,et al.  Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together , 2019, SIGMOD Conference.

[4]  Ciaran McCreesh,et al.  A Partitioning Algorithm for Maximum Common Subgraph Problems , 2017, IJCAI 2017.

[5]  Xin Jin,et al.  ASAP: Fast, Approximate Graph Pattern Mining at Scale , 2018, OSDI.

[6]  Viggo Kann,et al.  On the Approximability of the Maximum Common Subgraph Problem , 1992, STACS.

[7]  Amin Vahdat,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[8]  Yi Wang,et al.  A3: An Automatic Malfunction Detection and Fixation System in FatTree Data Center Networks , 2019, SIGCOMM Posters and Demos.

[9]  Xingyu Ma,et al.  Error Tolerant Address Configuration for Data Center Networks with Malfunctioning Devices , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[10]  Cid C. de Souza,et al.  The maximum common edge subgraph problem: A polyhedral investigation , 2012, Discret. Appl. Math..

[11]  Rubén J. Sánchez-García Exploiting symmetry in network analysis , 2018, Communications Physics.

[12]  H. Howie Huang,et al.  CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching , 2019, SIGMOD Conference.

[13]  Ciaran McCreesh,et al.  Between Subgraph Isomorphism and Maximum Common Subgraph , 2017, AAAI.

[14]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.