Understanding network failures in data centers: measurement, analysis, and implications

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

[1]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[2]  Albert G. Greenberg,et al.  A case study of OSPF behavior in a large enterprise network , 2002, IMW '02.

[3]  VahdatAmin,et al.  A scalable, commodity data center network architecture , 2008 .

[4]  David Watson,et al.  Experiences with monitoring OSPF on a regional service provider network , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[5]  Jennifer Rexford,et al.  Floodless in seattle: a scalable ethernet architecture for large enterprises , 2008, SIGCOMM '08.

[6]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[7]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[8]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[9]  Amin Vahdat,et al.  PortLand: a scalable fault-tolerant layer 2 data center network fabric , 2009, SIGCOMM '09.

[10]  Lei Shi,et al.  Dcell: a scalable and fault-tolerant network structure for data centers , 2008, SIGCOMM '08.

[11]  Sriram Ramabhadran,et al.  A study of end-to-end web access failures , 2006, CoNEXT '06.

[12]  Farnam Jahanian,et al.  Experimental Study of Internet Stabil-ity and Wide-Area Backbone Failures , 1998 .

[13]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[14]  Anees Shaikh,et al.  A First Look at Problems in the Cloud , 2010, HotCloud.

[15]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[16]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[17]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[18]  Antony I. T. Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[19]  DiotChristophe,et al.  Characterization of failures in an operational IP backbone network , 2008 .

[20]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[21]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[22]  Ion Stoica,et al.  A policy-aware switching layer for data centers , 2008, SIGCOMM '08.

[23]  Chen-Nee Chuah,et al.  Characterization of Failures in an Operational IP Backbone Network , 2008, IEEE/ACM Transactions on Networking.

[24]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[25]  Xu Chen,et al.  Declarative configuration management for complex and dynamic networks , 2010, CoNEXT.

[26]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.