DERP: A Deep Reinforcement Learning Cloud System for Elastic Resource Provisioning

Modern large scale computer clusters benefit significantly from elasticity. Elasticity allows a cluster to dynamically allocate computer resources, based on the user's fluctuating workload demands. Many cloud providers use threshold-based approaches, which have been proven to be difficult to configure and optimise, while others use reinforcement learning and decision-tree approaches, which struggle when having to handle large multidimensional cluster states. In this work we use Deep Reinforcement learning techniques to achieve automatic elasticity. We use three different approaches of a Deep Reinforcement learning agent, called DERP (Deep Elastic Resource Provisioning), that takes as input the current multi-dimensional state of a cluster and manages to train and converge to the optimal elasticity behaviour after a finite amount of training steps. The system automatically decides and proceeds on requesting/releasing VM resources from the provider and orchestrating them inside a NoSQL cluster according to user-defined policies/rewards. We compare our agent to state-of-the-art, Reinforcement learning and decision-tree based, approaches in demanding simulation environments and show that it gains rewards up to 1.6 times better on its lifetime. We then test our approach in a real life cluster environment and show that the system resizes clusters in real-time and adapts its performance through a variety of demanding optimisation strategies, input and training loads.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[6]  Ioannis Konstantinou,et al.  Automated, Elastic Resource Provisioning for NoSQL Clusters Using TIRAMOLA , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[7]  Michael Stonebraker,et al.  P-Store: An Elastic Database System with Predictive Provisioning , 2018, SIGMOD Conference.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Le Yi Wang,et al.  VCONF: a reinforcement learning approach to virtual machines auto-configuration , 2009, ICAC '09.

[10]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[12]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[13]  Ioannis Konstantinou,et al.  CELAR: Automated application elasticity platform , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  Gang Wang,et al.  Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[16]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[17]  Ioannis Konstantinou,et al.  Elastic management of cloud applications using adaptive reinforcement learning , 2017, 2017 IEEE International Conference on Big Data (Big Data).