Tagger: Practical PFC Deadlock Prevention in Data Center Networks

Remote direct memory access over converged Ethernet deployments is vulnerable to deadlocks induced by priority flow control. Prior solutions for deadlock prevention either require significant changes to routing protocols or require excessive buffers in the switches. In this paper, we propose Tagger, a scheme for deadlock prevention. It does not require any changes to the routing protocol and needs only modest buffers. Tagger is based on the insight that given a set of expected lossless routes, a simple tagging scheme can be developed to ensure that no deadlock will occur under any failure conditions. Packets that do not travel on these lossless routes may be dropped under extreme conditions. We design such a scheme, prove that it prevents deadlock, and implement it efficiently on commodity hardware.

[1]  Amin Vahdat,et al.  Helios: a hybrid electrical/optical switch architecture for modular data centers , 2010, SIGCOMM '10.

[2]  Pedro López,et al.  Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[3]  José Duato,et al.  A methodology for developing deadlock-free dynamic network reconfiguration processes. Part II , 2005, IEEE Transactions on Parallel and Distributed Systems.

[4]  Thomas E. Anderson,et al.  F10: A Fault-Tolerant Engineered Network , 2013, NSDI.

[5]  Giorgio Gambosi,et al.  Optimal Centralized Algorithms for Store-and-Forward Deadlock Avoidance , 1994, IEEE Trans. Computers.

[6]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[7]  Albert G. Greenberg,et al.  VL2: a scalable and flexible data center network , 2009, SIGCOMM '09.

[8]  Ankit Singla,et al.  Practical DCB for improved data center networks , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[9]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[10]  David Lee,et al.  Prevention of deadlocks and livelocks in lossless, backpressured packet networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[11]  Haitao Wu,et al.  RDMA over Commodity Ethernet at Scale , 2016, SIGCOMM.

[12]  Priority Flow Control : Build Reliable Layer 2 Infrastructure , 2015 .

[13]  Xin Wu,et al.  NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.

[14]  Keith D. Underwood,et al.  Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[15]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[16]  Ratul Mahajan,et al.  Don't Mind the Gap: Bridging Network-wide Objectives and Device-level Configurations , 2016, SIGCOMM.

[17]  Michael D. Schroeder,et al.  Automatic reconfiguration in Autonet , 1991, SOSP '91.

[18]  Ming Zhang,et al.  Congestion Control for Large-Scale RDMA Deployments , 2015, Comput. Commun. Rev..

[19]  Antonio Robles,et al.  A Survey and Evaluation of Topology-Agnostic Deterministic Routing Algorithms , 2012, IEEE Transactions on Parallel and Distributed Systems.

[20]  Jitendra Padhye,et al.  Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them , 2016, HotNets.

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Olav Lysne,et al.  Layered shortest path (LASH) routing in irregular system area networks , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[23]  Mario Gerla,et al.  Flow Control: A Comparative Survey , 1980, IEEE Trans. Commun..

[24]  Charles Clos,et al.  A study of non-blocking switching networks , 1953 .

[25]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[26]  J. D. Haenle,et al.  A Method of Deadlock-free Resource Allocation and Flow Control in Packet Networks , 1976, ICCC.

[27]  José Duato,et al.  Adaptive bubble router: a design to improve performance in torus networks , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[28]  DAVID GELERNTER A DAG-Based Algorithm for Prevention of Store-and-Forward Deadlock in Packet Networks , 1981, IEEE Transactions on Computers.

[29]  Ankit Singla,et al.  Jellyfish: Networking Data Centers Randomly , 2011, NSDI.

[30]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[31]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[32]  Timothy Mark Pinkston,et al.  An efficient, fully adaptive deadlock recovery scheme: DISHA , 1995, ISCA.

[33]  Patrick D. McDaniel,et al.  Toward Valley-Free Inter-domain Routing , 2007, 2007 IEEE International Conference on Communications.

[34]  Alexander Shpiner,et al.  Unlocking Credit Loop Deadlocks , 2016, HotNets.

[35]  Albert G. Greenberg,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM '10.

[36]  Nikhil R. Devanur,et al.  ProjecToR: Agile Reconfigurable Data Center Interconnect , 2016, SIGCOMM.

[37]  Paramvir Bahl,et al.  Augmenting data center networks with multi-gigabit wireless links , 2011, SIGCOMM.

[38]  Pedro López,et al.  A very efficient distributed deadlock detection mechanism for wormhole networks , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[39]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[40]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[41]  Torsten Hoefler,et al.  Deadlock-Free Oblivious Routing for Arbitrary Topologies , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[42]  WuJie A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003 .

[43]  José Duato,et al.  A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources , 2001, IEEE Trans. Parallel Distributed Syst..

[44]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[45]  Alan L. Cox,et al.  Deadlock-free local fast failover for arbitrary data center networks , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[46]  Antonio Robles,et al.  An effective methodology to improve the performance of the up*/down* routing algorithm , 2004, IEEE Transactions on Parallel and Distributed Systems.

[47]  José Duato,et al.  Generalized theory for deadlock-free adaptive wormhole routing and its application to Disha Concurrent , 1996, Proceedings of International Conference on Parallel Processing.

[48]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[49]  Jie Wu,et al.  A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003, IEEE Trans. Computers.

[50]  José Duato,et al.  Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability , 2003, IEEE Trans. Parallel Distributed Syst..