Decoupling of Distributed Consensus, Failure Detection and Agreement in SDN Control Plane

Centralized Software Defined Networking (SDN) controllers and Network Management Systems (NMS) introduce the issue of controller as a single-point of failure (SPOF). The SPOF correspondingly motivated the introduction of distributed controllers, with replicas assigned into clusters of controller instances replicated for purpose of enabling high availability. The replication of the controller state relies on distributed consensus and state synchronization for correct operation. Recent works have, however, demonstrated issues with this approach. False positives in failure detectors deployed in replicas may result in oscillating leadership and control plane unavailability.In this paper, we first elaborate the problematic scenario. We resolve the related issues by decoupling failure detector from the underlying signaling methodology and by introducing event agreement as a necessary component of the proposed design. The effectiveness of the proposed model is validated using an exemplary implementation and demonstration in the problematic scenario. We present an analytic model to describe the worst-case delay required to reliably agree on replica failures. The effectiveness of the analytic formulation is confirmed empirically using varied cluster configurations in an emulated environment. Finally, we discuss the impact of each component of our design on the replica failure- and recovery-detection delay, as well as on the imposed communication overhead.

[1]  Wolfgang Kellerer,et al.  MORPH: An Adaptive Framework for Efficient and Byzantine Fault-Tolerant SDN Control Plane , 2018, IEEE Journal on Selected Areas in Communications.

[2]  Wolfgang Kellerer,et al.  P4BFT: Hardware-Accelerated Byzantine-Resilient Network Control Plane , 2019, 2019 IEEE Global Communications Conference (GLOBECOM).

[3]  Myung-Sup Kim,et al.  Toward Highly Available and Scalable Software Defined Networks for Service Providers , 2017, IEEE Communications Magazine.

[4]  Bo Han,et al.  Network-Assisted Raft Consensus Algorithm , 2017, SIGCOMM Posters and Demos.

[5]  Zhi-Li Zhang,et al.  When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network , 2017, APNet.

[6]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[7]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[8]  Jan Medved,et al.  OpenDaylight: Towards a Model-Driven SDN Controller architecture , 2014, Proceeding of IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks 2014.

[9]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[10]  Giuseppe Di Fatta,et al.  Scalable and Fault Tolerant Failure Detection and Consensus , 2015, EuroMPI.

[11]  Wolfgang Kellerer,et al.  BFT Protocols for Heterogeneous Resource Allocations in Distributed SDN Control Plane , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[12]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[13]  Wolfgang Kellerer,et al.  Automated Bootstrapping of A Fault-Resilient In-Band Control Plane , 2020, SOSR.

[14]  Sangheon Pack,et al.  On performance of OpenDaylight clustering , 2016, 2016 IEEE NetSoft Conference and Workshops (NetSoft).

[15]  Robert S. Hanmer,et al.  Friend or Foe: Strong Consistency vs. Overload in High-Availability Distributed Systems and SDN , 2018, 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[16]  Song Guo,et al.  Byzantine-Resilient Secure Software-Defined Networks with Multiple Controllers in Cloud , 2014, IEEE Transactions on Cloud Computing.

[17]  Michael J. Freedman,et al.  Ravana: controller fault-tolerance in software-defined networking , 2015, SOSR.

[18]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[19]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20]  Kuochen Wang,et al.  Failure detection service with low mistake rates for SDN controllers , 2016, 2016 18th Asia-Pacific Network Operations and Management Symposium (APNOMS).

[21]  Tram Truong Huu,et al.  Primary-Backup Controller Mapping for Byzantine Fault Tolerance in Software Defined Networks , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[22]  Pavlin Radoslavov,et al.  ONOS: towards an open, distributed SDN OS , 2014, HotSDN.

[23]  Wolfgang Kellerer,et al.  Response Time and Availability Study of RAFT Consensus in Distributed SDN Control Plane , 2018, IEEE Transactions on Network and Service Management.

[24]  Wolfgang Kellerer,et al.  P4BFT: A Demonstration of Hardware-Accelerated BFT in Fault-Tolerant Network Control Plane , 2019, SIGCOMM Posters and Demos.

[25]  Fernando Pedone,et al.  Network Hardware-Accelerated Consensus , 2016, ArXiv.

[26]  Jon Crowcroft,et al.  Raft Refloated: Do We Have Consensus? , 2015, OPSR.

[27]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[28]  Wolfgang Kellerer,et al.  Towards adaptive state consistency in distributed SDN control plane , 2017, 2017 IEEE International Conference on Communications (ICC).

[29]  Wolfgang Kellerer,et al.  Impact of Adaptive Consistency on Distributed SDN Applications: An Empirical Study , 2018, IEEE Journal on Selected Areas in Communications.

[30]  Miguel Correia,et al.  State machine replication in containers managed by Kubernetes , 2017, J. Syst. Archit..

[31]  Nitin Naik,et al.  Applying Computational Intelligence for enhancing the dependability of multi-cloud systems using Docker Swarm , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[32]  Fernando Pedone,et al.  Paxos Made Switch-y , 2015, CCRV.

[33]  Alan D. George,et al.  Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters , 2004, Cluster Computing.

[34]  Benjamin Satzger,et al.  A new adaptive accrual failure detector for dependable distributed systems , 2007, SAC '07.