Chaos Monkey: Increasing SDN Reliability through Systematic Network Destruction

As modern networking applications become increasingly dynamic and high-bandwidth, software defined networking (SDN) has emerged as an agile, cost effective architecture with widespread adoption across industry. In SDN, the control-plane program runs on a logically-centralized controller which directly configures the packet-handling mechanisms in the underlying switches using an open API (e.g., OpenFlow). While the controller makes it exceptionally convenient for a network operator to control and manage a network, the controller requires complex logic and becomes a single point of failure within the network. As a result, configuration errors by the controller could be extremely costly for the network provider. Several SDN controllers have been developed since the conception of SDN, and network operators have utilized very traditional means of identifying bugs in the controller, such as unit testing and model checking [1]. However, it has become apparent that these methods cannot practically handle the inherent complexity of the controller platform that manages large networks. Ultimately, one major source of this complexity are network failures, as they trigger execution of unexplored portions of code; these network failures are inevitable, costly, and considering all possible interleaving of bugs is simply unfeasible. To address this problem, we propose “Chaos Monkey” a real-time post-deployment failure injection tool. Inspired by industry practices in the cloud [2], Chaos Monkey is intended to systematically introduce failure (e.g., link failure, network failure) into a network. Chaos Monkey is guided by the following design principles: