Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Network appliances or middleboxes such as firewalls, intrusion detection and prevention systems (IDPS), load balancers, and VPNs form an integral part of datacenters and enterprise networks. Realizing their importance and shortcomings, the research community has proposed software implementations, policy-aware switching, consolidation appliances, moving middlebox processing to VMs, end hosts, and even offloading it to the cloud. While such efforts can use middlebox failure characteristics to improve their reliability, management, and cost-effectiveness, little has been reported on these failures in the field. In this paper, we make one of the first attempts to perform a large-scale empirical study of middlebox failures over two years in a service provider network comprising thousands of middleboxes across tens of datacenters. We find that middlebox failures are prevalent and they can significantly impact hosted services. Several of our findings differ in key aspects from commonly held views: (1) Most failures are grey dominated by connectivity errors and link flaps that exhibit intermittent connectivity, (2) Hardware faults and overload problems are present but they are not in majority, (3) Middleboxes experience a variety of misconfigurations such as incorrect rules, VLAN misallocation and mismatched keys, and (4) Middlebox failover is ineffective in about 33\% of the cases for load balancers and firewalls due to configuration bugs, faulty failovers and software version mismatch. Finally, we analyze current middlebox proposals based on our study and discuss directions for future research.

[1]  Nick Feamster,et al.  Detecting BGP configuration faults with static analysis , 2005 .

[2]  Jürgen Quittek,et al.  Middlebox Communication (MIDCOM) Protocol Semantics , 2008, RFC.

[3]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[4]  Keith McCloghrie,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[5]  Martín Casado,et al.  Ethane: taking control of the enterprise , 2007, SIGCOMM '07.

[6]  Jeffrey L. Eppinger TCP Connections for P2P Apps: A Software Approach to Solving the NAT Problem , 2005 .

[7]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[8]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[9]  Navendu Jain,et al.  Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets , 2013, NSDI.

[10]  Jürgen Quittek,et al.  Middlebox Communications (MIDCOM) Protocol Semantics , 2005, RFC.

[11]  D. A. Pyke,et al.  Comparison of skewness coefficient, coefficient of variation, and Gini coefficient as inequality measures within populations , 1989, Oecologia.

[12]  Katerina J. Argyraki,et al.  Can software routers scale? , 2008, PRESTO '08.

[13]  Anna Hart,et al.  Mann-Whitney test is not just a test of medians: differences in spread can be important , 2001, BMJ : British Medical Journal.

[14]  Ion Stoica,et al.  A policy-aware switching layer for data centers , 2008, SIGCOMM '08.

[15]  Vyas Sekar,et al.  Design and Implementation of a Consolidated Middlebox Architecture , 2012, NSDI.

[16]  Thomas E. Anderson,et al.  ETTM: A Scalable Fault Tolerant Network Manager , 2011, NSDI.

[17]  Anja Feldmann,et al.  IP network configuration for intradomain traffic engineering , 2001, IEEE Netw..

[18]  Brian E. Carpenter,et al.  Middleboxes: Taxonomy and Issues , 2002, RFC.

[19]  Adrian Perrig,et al.  NATBLASTER: Establishing TCP Connections Between Hosts Behind NATs ∗ , 2005 .

[20]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[21]  Michael Walfish,et al.  Middleboxes No Longer Considered Harmful , 2004, OSDI.

[22]  Brighten Godfrey,et al.  Debugging the data plane with anteater , 2011, SIGCOMM.

[23]  R. Sakia The Box-Cox transformation technique: a review , 1992 .

[24]  Alan Ford,et al.  MPTCP Application Interface Considerations , 2010 .

[25]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[26]  Vyas Sekar,et al.  The middlebox manifesto: enabling innovation in middlebox deployment , 2011, HotNets-X.

[27]  Albert G. Greenberg,et al.  Towards a next generation data center architecture: scalability and commoditization , 2008, PRESTO '08.

[28]  Ming Zhang,et al.  An untold story of middleboxes in cellular networks , 2011, SIGCOMM.

[29]  Vyas Sekar,et al.  Making middleboxes someone else's problem: network processing as a cloud service , 2012, SIGCOMM '12.

[30]  Sriram Ramabhadran,et al.  A study of end-to-end web access failures , 2006, CoNEXT '06.

[31]  L. Ellram Total cost of ownership: an analysis approach for purchasing , 1995 .

[32]  John Loughney,et al.  Next Steps in Signaling (NSIS): Framework , 2005, RFC.

[33]  Mark Allman On the performance of middleboxes , 2003, IMC '03.

[34]  Jeffrey D. Case,et al.  Simple network management protocol , 1995 .

[35]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[36]  Sergiu Nedevschi,et al.  Reducing Network Energy Consumption via Sleeping and Rate-Adaptation , 2008, NSDI.

[37]  H. Lilliefors On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown , 1967 .

[38]  George Varghese,et al.  Header Space Analysis: Static Checking for Networks , 2012, NSDI.

[39]  John W. Lockwood,et al.  SRAM Programming SelectMap Interface EC EC VC VC Four Port Switch ccp Error Check VC VC Control Cell Asynchronous LineCardSwitch InterfaceCircuit Interface Processor Synch , 2001 .

[40]  G. Kesteven,et al.  The Coefficient of Variation , 1946, Nature.

[41]  Chen-Nee Chuah,et al.  Characterization of Failures in an Operational IP Backbone Network , 2008, IEEE/ACM Transactions on Networking.

[42]  Nick McKeown,et al.  Where is the debugger for my software-defined network? , 2012, HotSDN '12.

[43]  Hong Yan,et al.  A clean slate 4D approach to network control and management , 2005, CCRV.

[44]  Dale S. Johnson,et al.  NOC Internal Integrated Trouble Ticket System Functional Specification Wishlist ("NOC TT REQUIREMENTS") , 1992, RFC.

[45]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[46]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[47]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[48]  Vitaly Shmatikov,et al.  dFence: Transparent Network-based Denial of Service Mitigation , 2007, NSDI.

[49]  Jonathan D. Rosenberg,et al.  Middlebox communication architecture and framework , 2002, RFC.

[50]  Melinda Shore,et al.  Middlebox Communications (midcom) Protocol Requirements , 2002, RFC.

[51]  Albert G. Greenberg,et al.  A case study of OSPF behavior in a large enterprise network , 2002, IMW '02.

[52]  Marshall T. Rose,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[53]  Mark Handley,et al.  Flow processing and the rise of commodity network hardware , 2009, CCRV.