Fault Diagnosis for the Virtualized Network in the Cloud Environment using Reinforcement Learning

In the cloud environment, the virtualized network provides the connectivity to a massive of virtual machines through various virtual network devices. In such a complicated networking system, network faults are not occasional. It is urging for the system administrators to have the ability to investigate a fault and recover from it. However, the complexity of the virtualized network and the similarity among the symptom of faults makes the accurate diagnosis challenging. In this paper, we leverage the method of reinforcement learning to facilitate the fault diagnosis in the cloud environment, where it diagnoses the faults through an “exploration and exploitation” manner. Further, we investigate the key factors that influence the network performance and may cause the network faults. Based on this investigation, we present how to train the network diagnosis module with the Q-learning algorithm. Experimental results show that the diagnosis accuracy of our reinforcement learning based method is around 8% higher than traditional methods, and incurs very slight system overhead.

[1]  Gang Chen,et al.  Analysis and experimental demonstration of an optical switching enabled scalable data center network architecture , 2017, Opt. Switch. Netw..

[2]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[3]  Xiaozhou Li,et al.  Efficient querying and maintenance of network provenance at internet-scale , 2010, SIGMOD Conference.

[4]  Andreas Haeberlen,et al.  Secure network provenance , 2011, SOSP.

[5]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[6]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[7]  Keqiang He,et al.  PerfSight: Performance Diagnosis for Software Dataplanes , 2015, Internet Measurement Conference.

[8]  Lisandro Zambenedetti Granville,et al.  Data Center Network Virtualization: A Survey , 2013, IEEE Communications Surveys & Tutorials.

[9]  George Varghese,et al.  Header Space Analysis: Static Checking for Networks , 2012, NSDI.

[10]  Anja Feldmann,et al.  OFRewind: Enabling Record and Replay Troubleshooting for Networks , 2011, USENIX Annual Technical Conference.

[11]  Marco Canini,et al.  A NICE Way to Test OpenFlow Applications , 2012, NSDI.

[12]  Brighten Godfrey,et al.  Debugging the data plane with anteater , 2011, SIGCOMM.

[13]  Andreas Haeberlen,et al.  The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance , 2016, SIGCOMM.

[14]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.