论文信息 - Masking failures from application performance in data center networks with shareable backup

Masking failures from application performance in data center networks with shareable backup

Shareable backup is an economical and effective way to mask failures from application performance. A small number of backup switches are shared network-wide for repairing failures on demand so that the network quickly recovers to its full capacity without applications noticing the failures. This approach avoids complications and ineffectiveness of rerouting. We propose ShareBackup as a prototype architecture to realize this concept and present the detailed design. We implement ShareBackup on a hardware testbed. Its failure recovery takes merely 0.73ms, causing no disruption to routing; and it accelerates Spark and Tez jobs by up to 4.1X under failures. Large-scale simulations with real data center traffic and failure model show that ShareBackup reduces the percentage of job flows prolonged by failures from 47.2% to as little as 0.78%. In all our experiments, the results for ShareBackup have little difference from the no-failure case.

[1] Xiaoyuan Lu,et al. SFabric: a scalable SDN based large layer 2 data center network fabric , 2018, Cluster Computing.

[2] Alan L. Cox,et al. Deadlock-free local fast failover for arbitrary data center networks , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[3] Antony I. T. Rowstron,et al. XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers , 2016, NSDI.

[4] Haitao Wu,et al. BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[5] Qunfeng Dong,et al. WaveCube: A scalable, fault-tolerant, high-performance optical data center architecture , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[6] Monia Ghobadi,et al. Understanding and Mitigating Packet Corruption in Data Center Networks , 2017, SIGCOMM.

[7] Junda Liu,et al. Ensuring connectivity via data plane mechanisms , 2013, NSDI 2013.

[8] T. S. Eugene Ng,et al. A Tale of Two Topologies: Exploring Convertible Data Center Network Architectures with Flat-tree , 2017, SIGCOMM.

[9] Ankit Singla,et al. Jellyfish: Networking Data Centers Randomly , 2011, NSDI.

[10] Alia Atlas,et al. Fast Reroute Extensions to RSVP-TE for LSP Tunnels , 2005, RFC.

[11] Alan L. Cox,et al. Scalable Multi-Failure Fast Failover via Forwarding Table Compression , 2016, SOSR.

[12] Tony Li,et al. Cisco Hot Standby Router Protocol (HSRP) , 1998, RFC.

[13] Amin Vahdat,et al. Integrating microsecond circuit switching into the data center , 2013, SIGCOMM.

[14] Hong Liu,et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[15] Amin Vahdat,et al. A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[16] M. Tan,et al. Configurable optical interconnects for scalable datacenters , 2013, 2013 Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference (OFC/NFOEC).

[17] Thomas E. Anderson,et al. F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[18] Michael Dinitz,et al. Xpander: Towards Optimal-Performance Datacenters , 2016, CoNEXT.

[19] David A. Maltz,et al. Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[20] T. S. Eugene Ng,et al. Stop Rerouting!: Enabling ShareBackup for Failure Recovery in Data Center Networks , 2017, HotNets.

[21] T. S. Eugene Ng,et al. Flat-tree: A Convertible Data Center Network Architecture from Clos to Random Graph , 2016, HotNets.

[22] Scott Shenker,et al. Achieving convergence-free routing using failure-carrying packets , 2007, SIGCOMM 2007.

[23] Gal Shahaf,et al. Beyond fat-trees without antennae, mirrors, and disco-balls , 2017, SIGCOMM.

[24] Ming C. Wu,et al. Optical MEMS for Lightwave Communication , 2006, Journal of Lightwave Technology.

[25] Junda Liu,et al. Keep Forwarding: Towards k-link failure resilient routing , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[26] Ankit Singla,et al. Designing data center networks for high throughput , 2015 .

[27] Stefan Schmid,et al. Provable data plane connectivity with local fast failover: introducing openflow graph algorithms , 2014, HotSDN.

[28] Albert G. Greenberg,et al. The nature of data center traffic: measurements & analysis , 2009, IMC '09.