RILNET: A Reinforcement Learning Based Load Balancing Approach for Datacenter Networks

Modern datacenter networks are facing various challenges, e.g., highly dynamic workloads, congestion, topology asymmetry. ECMP, as a traditional load balancing mechanism which is widely used in today’s datacenters, can balance load poorly and lead to congestion. Variety of load balancing schemes are proposed to address the problems of ECMP. However, these traditional schemes usually make load balancing decision only based on network knowledge for a snapshot or a short time past. In this paper, we propose a Reinforcement Learning (RL) based approach, called RILNET (ReInforcement Learning NETworking), aiming at load balancing for datacenter networks. RILNET employs RL to learn a network and control it based on the learned experience. To achieve a higher granularity of control, RILNET is constructed to route flowlet rather than flows. Moreover, RILNET makes routing decisions for aggregation flows (an aggregation flow is a flow set that includes all flows flowing from the same source edge switch to the same destination edge switch) instead of a single flow. In order to test performance of RILNET, we propose a flow-level simulation and a packet-level simulation, and the both results show that RILNET can balance traffic load much more effectively than ECMP and another load balancing solution, i.e., DRILL. Compared with DRILL, RILNET outperforms DRILL in data loss and maximal link delay. Specifically, the maximal link data loss and the maximal link delay of RILNET are 44.4% and 25.4% smaller than DRILL, respectively.

[1]  Amin Vahdat,et al.  Hedera: Dynamic Flow Scheduling for Data Center Networks , 2010, NSDI.

[2]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Hussein Suleman,et al.  Using SDN and reinforcement learning for traffic engineering in UbuntuNet Alliance , 2016, 2016 International Conference on Advances in Computing and Communication Engineering (ICACCE).

[4]  Keqiang He,et al.  Presto: Edge-based Load Balancing for Fast Datacenter Networks , 2015, SIGCOMM.

[5]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[6]  Gautam Kumar,et al.  FairCloud: sharing the network in cloud computing , 2011, CCRV.

[7]  Ian F. Akyildiz,et al.  QoS-Aware Adaptive Routing in Multi-layer Hierarchical Software Defined Networks: A Reinforcement Learning Approach , 2016, 2016 IEEE International Conference on Services Computing (SCC).

[8]  Brighten Godfrey,et al.  Micro Load Balancing in Data Centers with DRILL , 2015, HotNets.

[9]  Rodrigo Fonseca,et al.  Planck: millisecond-scale monitoring and control for commodity networks , 2015, SIGCOMM 2015.

[10]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[11]  Srikanth Kandula,et al.  Dynamic load balancing without packet reordering , 2007, CCRV.

[12]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[13]  Hong Zhang,et al.  Resilient Datacenter Load Balancing in the Wild , 2017, SIGCOMM.

[14]  George Varghese,et al.  CONGA: distributed congestion-aware load balancing for datacenters , 2015, SIGCOMM.

[15]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[16]  Navendu Jain,et al.  Understanding network failures in data centers , 2011, SIGCOMM 2011.