MAGNETIC: Multi-Agent Machine Learning-Based Approach for Energy Efficient Dynamic Consolidation in Data Centers

Improving the energy efficiency of data centers while guaranteeing Quality of Service (QoS), together with detecting performance variability of servers caused by either hardware or software failures, are two of the major challenges for efficient resource management of large-scale cloud infrastructures. Previous works in the area of dynamic Virtual Machine (VM) consolidation are mostly focused on addressing the energy challenge, but fall short in proposing comprehensive, scalable, and low-overhead approaches that jointly tackle energy efficiency and performance variability. Moreover, they usually assume over-simplistic power models, and fail to accurately consider all the delay and power costs associated with VM migration and host power mode transition. These assumptions are no longer valid in modern servers executing heterogeneous workloads and lead to unrealistic or inefficient results. In this paper, we propose a centralized-distributed low-overhead failure-aware dynamic VM consolidation strategy to minimize energy consumption in large-scale data centers. Our approach selects the most adequate power mode and frequency of each host during runtime using a distributed multi-agent Machine Learning (ML) based strategy, and migrates the VMs accordingly using a centralized heuristic. Our Multi-AGent machine learNing-based approach for Energy efficienT dynamIc Consolidation (MAGNETIC) is implemented in a modified version of the CloudSim simulator, and considers the energy and delay overheads associated with host power mode transition and VM migration, and is evaluated using power traces collected from various workloads running in real servers and resource utilization logs from cloud data center infrastructures. Results show how our strategy reduces data center energy consumption by up to 15% compared to other works in the state-of-the-art (SoA), guaranteeing the same QoS and reducing the number of VM migrations and host power mode transitions by up to 86% and 90%, respectively. Moreover, it shows better scalability than all other approaches, taking less than 0.7% time overhead to execute for a data center with 1500 VMs. Finally, our solution is capable of detecting host performance variability due to failures, automatically migrating VMs from failing hosts and draining them from workload.

[1]  James Norris,et al.  Agile, efficient virtualization power management with low-latency server power states , 2013, ISCA.

[2]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  Akshat Verma,et al.  pMapper: Power and Migration Cost Aware Application Placement in Virtualized Systems , 2008, Middleware.

[5]  Qingsheng Zhu,et al.  Energy and Migration Cost-Aware Dynamic Virtual Machine Consolidation in Heterogeneous Cloud Datacenters , 2019, IEEE Transactions on Services Computing.

[6]  X. Wang,et al.  Modern power system planning , 1994 .

[7]  Rajkumar Buyya,et al.  Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in Cloud data centers , 2012, Concurr. Comput. Pract. Exp..

[8]  Stephen L. Olivier,et al.  Enabling Advanced Operational Analysis Through Multi-subsystem Data Integration on Trinity. , 2015 .

[9]  Antti Ylä-Jääski,et al.  Virtual Machine Consolidation with Multiple Usage Prediction for Energy-Efficient Cloud Data Centers , 2020, IEEE Transactions on Services Computing.

[10]  Mahesh Rajan,et al.  Toward Rapid Understanding of Production HPC Applications and Systems , 2015, 2015 IEEE International Conference on Cluster Computing.

[11]  Olumuyiwa Ibidunmoye,et al.  Performance anomaly detection and resolution for autonomous clouds , 2017 .

[12]  Jian Tang,et al.  Survivable Virtual Infrastructure Mapping in Virtualized Data Centers , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[13]  Michael C. Caramanis,et al.  The data center as a grid load stabilizer , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Luca Benini,et al.  Energy proportionality in near-threshold computing servers and cloud data centers: Consolidating or Not? , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Ramin Yahyapour,et al.  A Heuristic-Based Approach for Dynamic VMs Consolidation in Cloud Data Centers , 2017 .

[16]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[17]  Sangyoon Oh,et al.  Sercon: Server Consolidation Algorithm using Live Migration of Virtual Machines for Green Computing , 2011 .

[18]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[19]  Pedro Malagón,et al.  Self-organizing Maps versus Growing Neural Gas in Detecting Anomalies in Data Centres , 2015, Log. J. IGPL.

[20]  Maziar Goudarzi,et al.  Server Consolidation Techniques in Virtualized Data Centers: A Survey , 2017, IEEE Systems Journal.

[21]  Elisabeth Baseman,et al.  Interpretable Anomaly Detection for Monitoring of High Performance Computing Systems , 2016 .

[22]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[23]  Yijia Zhang,et al.  Diagnosing Performance Variations in HPC Applications Using Machine Learning , 2017, ISC.

[24]  Rajkumar Buyya,et al.  E-eco: Performance-aware energy-efficient cloud data center orchestration , 2017, J. Netw. Comput. Appl..

[25]  Rajkumar Buyya,et al.  Energy-aware resource allocation heuristics for efficient management of data centers for Cloud computing , 2012, Future Gener. Comput. Syst..

[26]  Michael C. Caramanis,et al.  Dynamic server power capping for enabling data center participation in power markets , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[27]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[28]  Thomas W. Tucker,et al.  The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Luca Castellazzi,et al.  Trends in Data Centre Energy Consumption under the European Code of Conduct for Data Centre Energy Efficiency , 2017 .

[30]  Inderveer Chana,et al.  Energy-aware Virtual Machine Migration for Cloud Computing - A Firefly Optimization Approach , 2016, Journal of Grid Computing.

[31]  Christine Morin,et al.  A case for fully decentralized dynamic VM consolidation in clouds , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[32]  Richard E. Brown,et al.  United States Data Center Energy Usage Report , 2016 .

[33]  KyoungSoo Park,et al.  CoMon: a mostly-scalable monitoring system for PlanetLab , 2006, OPSR.

[34]  José Manuel Moya,et al.  Leakage-Aware Cooling Management for Improving Server Energy Efficiency , 2015, IEEE Transactions on Parallel and Distributed Systems.

[35]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[36]  Hai Jin,et al.  Performance and energy modeling for live migration of virtual machines , 2011, Cluster Computing.

[37]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[38]  Maarten van Steen,et al.  CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays , 2005, Journal of Network and Systems Management.