High performance multipath routing for datacenters

Performance-optimized datacenter networks aim to handle more efficiently the growing East-West intra-cluster traffic of BigData applications. The demanding latency constraints and traffic patterns of these applications expose the inherent bottlenecks of the often oversubscribed datacenter network topologies, favoring in stead the full-bisectional bandwidth fat-trees. And yet their topological benefits may remain unrealized in practical deployments, if such fabrics use single path or flow-level (ECMP hashing) multipath routing. Here we model in detail on Layer 2 the routing performance of modern fat-tree networks using stochastic permutations of bursty traffic. We first analytically simplify and then validate by accurate simulation models that the throughputs for `static' d-mod-k and for ECMP-like multipath routing are 63% and 47%, respectively. We also find that ECMP routing results in a wide spread of link loads under random permutation traffic, which manifests as a 3x throughput reduction for 30% of the flows. Furthermore, ECMP can lead to collisions of mouse and elephant flows, often increasing the flow completion time (FCT) of delay-sensitive flows by a factor of 10. In contrast, packet-based multipath outperforms all the others in this study.

[1]  Ramana Rao Kompella,et al.  On the impact of packet spraying in data center networks , 2013, 2013 Proceedings IEEE INFOCOM.

[2]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[3]  Pu Li,et al.  The Switch Reordering Contagion: Preventing a Few Late Packets from Ruining the Whole Party , 2014, IEEE Transactions on Computers.

[4]  Mohammad Alizadeh,et al.  On the Data Path Performance of Leaf-Spine Datacenter Fabrics , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[5]  Brighten Godfrey,et al.  Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[6]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[7]  Eiji Oki,et al.  CIXOB-k: combined input-crosspoint-output buffered packet switch , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[8]  Pedro López,et al.  Deterministic versus Adaptive Routing in Fat-Trees , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Yeh-Ching Chung,et al.  A multiple LID routing scheme for fat-tree-based InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  Guansong Zhang,et al.  TrueWay: a highly scalable multi-plane multi-stage buffered packet switch , 2005, HPSR. 2005 Workshop on High Performance Switching and Routing, 2005..

[11]  Nick McKeown,et al.  Deconstructing datacenter packet transport , 2012, HotNets-XI.

[12]  Nick McKeown,et al.  Why flow-completion time is the right metric for congestion control , 2006, CCRV.

[13]  Randy H. Katz,et al.  DeTail: reducing the flow completion time tail in datacenter networks , 2012, SIGCOMM '12.

[14]  Jose Renato Santos,et al.  Killer Fabrics for Scalable Datacenters , 2009 .

[15]  Cyriel Minkenberg,et al.  End-to-end congestion management for non-blocking multi-stage switching fabrics , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[16]  David A. Maltz,et al.  Data center TCP (DCTCP) , 2010, SIGCOMM 2010.

[17]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[18]  Leslie G. Valiant,et al.  Universal schemes for parallel communication , 1981, STOC '81.

[19]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[20]  Rami G. Melhem,et al.  Oblivious Routing in Fat-Tree Based System Area Networks With Uncertain Traffic Demands , 2007, IEEE/ACM Transactions on Networking.