Randomized Local Fast Rerouting for Datacenter Networks with Almost Optimal Congestion

To ensure high availability, datacenter networks must rely on local fast rerouting mechanisms that allow routers to quickly react to link failures, in a fully decentralized manner. However, configuring these mechanisms to provide a high resilience against multiple failures while avoiding congestion along failover routes is algorithmically challenging, as the rerouting rules can only depend on local failure information and must be defined ahead of time. This paper presents a randomized local fast rerouting algorithm for Clos networks, the predominant datacenter topologies. Given a graph G = (V, E) describing a Clos topology, our algorithm defines local routing rules for each node v ∈ V , which only depend on the packet’s destination and are conditioned on the incident link failures. We prove that as long as number of failures at each node does not exceed a certain bound, our algorithm achieves an asymptotically minimal congestion up to polyloglog factors along failover paths. Our lower bounds are developed under some natural routing assumptions. 2012 ACM Subject Classification Theory of computation → Approximation algorithms analysis; Theory of computation → Distributed algorithms; Networks → Data path algorithms

[1]  Marco Chiesa,et al.  The quest for resilient (static) forwarding tables , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[2]  Marco Chiesa,et al.  On the Resiliency of Randomized Routing Against Multiple Edge Failures , 2016, ICALP.

[3]  Alexander Russell,et al.  Distributed scheduling for disconnected cooperation , 2005, Distributed Computing.

[4]  Marco Chiesa,et al.  A Survey of Fast-Recovery Mechanisms in Packet-Switched Networks , 2021, IEEE Communications Surveys & Tutorials.

[5]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[6]  Charles E. Leiserson,et al.  Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.

[7]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[8]  Stefan Schmid,et al.  Load-Optimal Local Fast Rerouting for Resilient Networks , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[9]  János Tapolcai,et al.  Sufficient conditions for protection routing in IP networks , 2013, Optim. Lett..

[10]  Mahmoud Parham,et al.  Maximally Resilient Replacement Paths for a Family of Product Graphs , 2020, OPODIS.

[11]  Stefan Schmid,et al.  Local Fast Rerouting with Low Congestion: A Randomized Approach , 2019, 2019 IEEE 27th International Conference on Network Protocols (ICNP).

[12]  Marco Chiesa,et al.  On the Resiliency of Static Forwarding Tables , 2017, IEEE/ACM Transactions on Networking.

[13]  Dimitri P. Bertsekas,et al.  Distributed Algorithms for Generating Loop-Free Routes in Networks with Frequently Changing Topology , 1981, IEEE Trans. Commun..

[14]  Stefan Schmid,et al.  How (Not) to Shoot in Your Foot with SDN Local Fast Failover - A Load-Connectivity Tradeoff , 2013, OPODIS.

[15]  Gilles Tredan,et al.  CASA: Congestion and Stretch Aware Static Fast Rerouting , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[16]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[17]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[18]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[19]  Joan Feigenbaum,et al.  Brief announcement: on the resilience of routing tables , 2012, PODC '12.

[20]  Olivier Bonaventure,et al.  Achieving sub-second IGP convergence in large IP networks , 2005, CCRV.

[21]  Stefan Schmid,et al.  On the Feasibility of Perfect Resilience with Local Fast Failover , 2020, ArXiv.

[22]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[23]  Junda Liu,et al.  Ensuring connectivity via data plane mechanisms , 2013, NSDI 2013.

[24]  Abdul Kabbani,et al.  FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks , 2014, CoNEXT.