A Comprehensive Study of Bugs in Software Defined Networks

Software-defined networking (SDN) enables innovative and impressive solutions in the networking domain by decoupling the control plane from the data plane. In an SDN environment, the network control logic for load balancing, routing, and access control is written in software running on a decoupled control plane. As with any software development cycle, the SDN control plane is prone to bugs that impact the network’s performance and availability. Yet, as a community, we lack holistic, in-depth studies of bugs within the SDN ecosystem. A bug taxonomy is one of the most promising ways to lay the foundations required for (1) evaluating and directing emerging research directions on fault detection and recovery, and (2) informing operational practices of network administrators. This paper takes the first step towards laying this foundation by providing a comprehensive study and analysis of over 500 ‘critical’ bugs (including $\sim 150$ with manual analysis) in three of the most widely-used SDN controllers, i.e., FAUCET, ONOS, and CORD. We create a taxonomy of these SDN bugs, analyze their operational impact, and implications for the developers. We use our taxonomy to analyze the effectiveness and coverage of several prominent SDN fault tolerance and diagnosis techniques. This study is the first of its kind in scale and coverage to the best of our knowledge.

[1]  Theophilus Benson,et al.  Isolating and Tolerating SDN Application Failures with LegoSDN , 2016, SOSR.

[2]  Hui Xu,et al.  Memory-Safety Challenge Considered Solved? An Empirical Study with All Rust CVEs , 2020, ArXiv.

[3]  S. Shenker,et al.  Ethane: taking control of the enterprise , 2007, SIGCOMM '07.

[4]  Wolfgang Kellerer,et al.  DASON: Dependability Assessment Framework for Imperfect Distributed SDN Implementations , 2020, IEEE Transactions on Network and Service Management.

[5]  Hong Yan,et al.  A clean slate 4D approach to network control and management , 2005, CCRV.

[6]  Girish Suryanarayana,et al.  Chapter 5 – Modularization Smells , 2015 .

[7]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[8]  Marco Canini,et al.  A SOFT way for openflow switch interoperability testing , 2012, CoNEXT '12.

[9]  Junda Liu,et al.  Libra: Divide and Conquer to Verify Forwarding Tables in Huge Networks , 2014, NSDI.

[10]  Jaechang Nam,et al.  Automatic patch generation learned from human-written patches , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[11]  Brighten Godfrey,et al.  VeriFlow: verifying network-wide invariants in real time , 2012, HotSDN '12.

[12]  Scott Shenker,et al.  SCL: Simplifying Distributed SDN Control Planes , 2017, NSDI.

[13]  Ben Y. Zhao,et al.  Predictive Analysis in Network Function Virtualization , 2018, Internet Measurement Conference.

[14]  George Varghese,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 99 Real Time Network Policy Checking Using Header Space Analysis , 2022 .

[15]  Ronald J. Pelias,et al.  Front matter , 2017, 2017 IEEE International Symposium on Consumer Electronics (ISCE).

[16]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[17]  Zuoning Yin,et al.  Towards understanding bugs in open source router software , 2010, CCRV.

[18]  Nicolae Paladi,et al.  SDN Access Control for the Masses , 2018, Comput. Secur..

[19]  Steven D. Gribble,et al.  Configuration Debugging as Search: Finding the Needle in the Haystack , 2004, OSDI.

[20]  Claire Le Goues,et al.  A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[21]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[22]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[23]  Manuel Costa,et al.  Bouncer: securing software by blocking bad input , 2008, WRAITS '08.

[24]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.

[25]  Dong Wang,et al.  An empirical study on crash recovery bugs in large-scale distributed systems , 2018, ESEC/SIGSOFT FSE.

[26]  Lei Xu,et al.  Attacking the Brain: Races in the SDN Control Plane , 2017, USENIX Security Symposium.

[27]  Junjie Liu,et al.  Improving SDN Scalability With Protocol-Oblivious Source Routing: A System-Level Study , 2018, IEEE Transactions on Network and Service Management.

[28]  Tianyin Xu,et al.  Systems Approaches to Tackling Configuration Errors , 2015, ACM Comput. Surv..

[29]  A. Murat Tekalp,et al.  OpenQoS: An OpenFlow controller design for multimedia delivery with end-to-end Quality of Service over Software-Defined Networks , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[30]  Martin Lippert,et al.  Refactoring in Large Software Projects , 2006 .

[31]  Diomidis Spinellis,et al.  An empirical investigation on the relationship between design and architecture smells , 2020, Empirical Software Engineering.

[32]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[33]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[34]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[35]  Andreas Haeberlen,et al.  Automated Bug Removal for Software-Defined Networks , 2017, NSDI.

[36]  Marco Canini,et al.  A NICE Way to Test OpenFlow Applications , 2012, NSDI.

[37]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.

[38]  Daniela Cruzes,et al.  The evolution and impact of code smells: A case study of two open source systems , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[39]  Robert L. Nord,et al.  Technical Debt: From Metaphor to Theory and Practice , 2012, IEEE Software.

[40]  Hridesh Rajan,et al.  A comprehensive study on deep learning bug characteristics , 2019, ESEC/SIGSOFT FSE.

[41]  Peter Saint-Andre Extensible Messaging and Presence Protocol (XMPP): Core , 2011, RFC.

[42]  Girish Suryanarayana,et al.  Chapter 2 – Design Smells , 2015 .

[43]  Myungjin Lee,et al.  Fault Localization in Large-Scale Network Policy Deployment , 2017, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[44]  Laurent Vanbever,et al.  Destroying networks for fun (and profit) , 2015, HotNets.

[45]  Cheng Li,et al.  Finding complex concurrency bugs in large multi-threaded applications , 2011, EuroSys '11.

[46]  Dawei Qi,et al.  SemFix: Program repair via semantic analysis , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[47]  Wolfgang Kellerer,et al.  Assessing the Maturity of SDN Controllers With Software Reliability Growth Models , 2018, IEEE Transactions on Network and Service Management.

[48]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[49]  Shriram Krishnamurthi,et al.  Static Differential Program Analysis for Software-Defined Networks , 2015, FM.

[50]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with Dynamic Information Flow Analysis , 2010, OSDI.

[51]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[52]  Matthias Noback The Stable Dependencies Principle , 2018 .

[53]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[54]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[55]  Derek Dreyer,et al.  Safe systems programming in Rust , 2021, Commun. ACM.

[56]  Nenad Medvidovic,et al.  Toward a Catalogue of Architectural Bad Smells , 2009, QoSA.

[57]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[58]  Nick Feamster,et al.  The road to SDN: an intellectual history of programmable networks , 2014, CCRV.

[59]  Michael J. Freedman,et al.  Ravana: controller fault-tolerance in software-defined networking , 2015, SOSR.

[60]  Raed Shatnawi,et al.  An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution , 2007, J. Syst. Softw..

[61]  Chris Parnin,et al.  Can automated pull requests encourage software developers to upgrade out-of-date dependencies? , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[62]  Limin Xiao,et al.  A Load Balancing Strategy of SDN Controller Based on Distributed Decision , 2014, 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications.

[63]  Mudhakar Srivatsa,et al.  Learning to Simplify Distributed Systems Management , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[64]  Katsuro Inoue,et al.  Do developers update their library dependencies? , 2017, Empirical Software Engineering.

[65]  Brent Byunghoon Kang,et al.  Rosemary: A Robust, Secure, and High-performance Network Operating System , 2014, CCS.

[66]  Vijay Mann,et al.  JURY: Validating Controller Actions in Software-Defined Networks , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[67]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[68]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[69]  Iftekhar Ahmed,et al.  An Empirical Study of Design Degradation: How Software Projects Get Worse over Time , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[70]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[71]  Suman Nath,et al.  What bugs cause production cloud incidents? , 2019, HotOS.

[72]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[73]  Mona Attariyan,et al.  Automating Configuration Troubleshooting with ConfAid , 2011, login Usenix Mag..

[74]  Jie Wang,et al.  A comprehensive study on real world concurrency bugs in Node.js , 2017, 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[75]  Tushar Sharma,et al.  Designite - A Software Design Quality Assessment Tool , 2016, 2016 IEEE/ACM 1st International Workshop on Bringing Architectural Design Thinking Into Developers' Daily Activities (BRIDGE).

[76]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[77]  Xing Zhao,et al.  SWAN: An SDN based campus WLAN framework , 2014, 2014 4th International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace & Electronic Systems (VITAE).

[78]  Martín Casado,et al.  Network Virtualization in Multi-tenant Datacenters , 2014, NSDI.

[79]  Xin Li,et al.  Distributed and collaborative traffic monitoring in software defined networks , 2014, HotSDN.

[80]  Rob Sherwood,et al.  FlowVisor: A Network Virtualization Layer , 2009 .

[81]  Ying Zhang,et al.  FBOSS: building switch software at scale , 2018, SIGCOMM.

[82]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[83]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[84]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[85]  Kelly Blincoe,et al.  Dependency Versioning in the Wild , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[86]  Vijay Mann,et al.  SPHINX: Detecting Security Attacks in Software-Defined Networks , 2015, NDSS.

[87]  Fernando M. V. Ramos,et al.  Software-Defined Networking: A Comprehensive Survey , 2014, Proceedings of the IEEE.

[88]  Zhi Liu,et al.  Troubleshooting blackbox SDN control software with minimal causal sequences , 2014 .

[89]  Nick Feamster,et al.  Design and implementation of a routing control platform , 2005, NSDI.

[90]  Justin Cappos,et al.  Lock-in-Pop: Securing Privileged Operating System Kernels by Keeping on the Beaten Path , 2017, USENIX Annual Technical Conference.

[91]  Rajkumar Buyya,et al.  A Taxonomy of Software-Defined Networking (SDN)-Enabled Cloud Computing , 2018, ACM Comput. Surv..

[92]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[93]  Wolfgang Kellerer,et al.  Mining Software Repositories for Predictive Modelling of Defects in SDN Controller , 2019, 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM).

[94]  Romain Robbes,et al.  Recovering inter-project dependencies in software ecosystems , 2010, ASE.

[95]  Lorenzo Keller,et al.  ConfErr: A tool for assessing resilience to human configuration errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[96]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[97]  Anja Feldmann,et al.  Logically centralized?: state distribution trade-offs in software defined networks , 2012, HotSDN '12.

[98]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.