FBOSS: building switch software at scale

The conventional software running on network devices, such as switches and routers, is typically vendor-supplied, proprietary and closed-source; as a result, it tends to contain extraneous features that a single operator will not most likely fully utilize. Furthermore, cloud-scale data center networks often times have software and operational requirements that may not be well addressed by the switch vendors. In this paper, we present our ongoing experiences on overcoming the complexity and scaling issues that we face when designing, developing, deploying and operating an in-house software built to manage and support a set of features required for data center switches of a large scale Internet content provider. We present FBOSS, our own data center switch software, that is designed with the basis on our switch-as-a-server and deploy-early-and-iterate principles. We treat software running on data center switches as any other software services that run on a commodity server. We also build and deploy only a minimal number of features and iterate on it. These principles allow us to rapidly iterate, test, deploy and manage FBOSS at scale. Over the last five years, our experiences show that FBOSS's design principles allow us to quickly build a stable and scalable network. As evidence, we have successfully grown the number of FBOSS instances running in our data center by over 30x over a two year period.

[1]  Bert Wijnen,et al.  An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks , 2002, RFC.

[2]  David Wetherall,et al.  Towards an active network architecture , 1996, CCRV.

[3]  Ítalo S. Cunha,et al.  Engineering Egress with Edge Fabric: Steering Oceans of Content to the World , 2017, SIGCOMM.

[4]  Thomas Narten,et al.  IPv6 Stateless Address Autoconfiguration , 1996, RFC.

[5]  Adrian Farrel,et al.  A Path Computation Element (PCE)-Based Architecture , 2006, RFC.

[6]  Martín Casado,et al.  The Design and Implementation of Open vSwitch , 2015, NSDI.

[7]  Jon Mitchell,et al.  Use of BGP for Routing in Large-Scale Data Centers , 2016, RFC.

[8]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[9]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[10]  Hongyi Zeng,et al.  Robotron: Top-down Network Management at Facebook Scale , 2016, SIGCOMM.

[11]  Jamal Hadi Salim,et al.  Forwarding and Control Element Separation (ForCES) Forwarding Element Model , 2010, RFC.

[12]  Xin Jin,et al.  Your Data Center Switch is Trying Too Hard , 2016, SOSR.

[13]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[14]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[15]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[16]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[17]  Lixin Gao,et al.  Towards reliable and lightweight source switching for datacenter networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[18]  David Wetherall,et al.  Towards an active network architecture , 1996, Proceedings DARPA Active Networks Conference and Exposition.

[19]  Martín Casado,et al.  Onix: A Distributed Control Platform for Large-scale Production Networks , 2010, OSDI.

[20]  Kirk Lougheed,et al.  Border Gateway Protocol (BGP) , 2021, IP Routing Protocols.

[21]  Gregg Rothermel,et al.  Techniques for improving regression testing in continuous integration development environments , 2014, SIGSOFT FSE.

[22]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[23]  Jamal Hadi Salim Forwarding and Control Element Separation (ForCES) Protocol Extensions , 2014, RFC.

[24]  Martín Casado,et al.  NOX: towards an operating system for networks , 2008, CCRV.

[25]  Qi Huang,et al.  Gorilla: A Fast, Scalable, In-Memory Time Series Database , 2015, Proc. VLDB Endow..