SOL: Safe On-Node Learning in Cloud Platforms

Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.

[1]  Christoforos E. Kozyrakis,et al.  SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud , 2021, EuroSys.

[2]  Brandon Lucia,et al.  Adaptive low-overhead scheduling for periodic and reactive intermittent execution , 2020, PLDI.

[3]  Zi Yan,et al.  Nimble Page Management for Tiered Memory Systems , 2019, ASPLOS.

[4]  Ricardo Bianchini,et al.  Toward ML-centric cloud platforms , 2020, Commun. ACM.

[5]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Christopher Olston,et al.  TensorFlow-Serving: Flexible, High-Performance ML Serving , 2017, ArXiv.

[7]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[8]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[9]  Hongzi Mao,et al.  Towards Safe Online Reinforcement Learning in Computer Systems , 2019 .

[10]  Ricardo Bianchini,et al.  Cost-Efficient Overclocking in Immersion-Cooled Datacenters , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[11]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[12]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[13]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[14]  Hongzi Mao,et al.  Placeto: Efficient Progressive Device Placement Optimization , 2018 .

[15]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[16]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[17]  W. B. Roberts,et al.  Machine Learning: The High Interest Credit Card of Technical Debt , 2014 .

[18]  Jeongseob Ahn,et al.  Exploring the Design Space of Page Management for Multi-Tiered Memory Systems , 2021, USENIX Annual Technical Conference.

[19]  Ricardo Bianchini,et al.  Prediction-Based Power Oversubscription in Cloud Platforms , 2020, USENIX Annual Technical Conference.

[20]  Brandon Lucia,et al.  Automatically enforcing fresh and consistent inputs in intermittent systems , 2021, PLDI.

[21]  Heon Y. Yeom,et al.  Profiling Dynamic Data Access Patterns with Controlled Overhead and Quality , 2019, Middleware Industry.

[22]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[23]  Thierry Coppey,et al.  SmartChoices: Hybridizing Programming and Machine Learning , 2019 .

[24]  Paul M. Carpenter,et al.  Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Jichuan Chang,et al.  Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[26]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Henry Hoffmann,et al.  ESP: A Machine Learning Approach to Predicting Application Interference , 2017, 2017 IEEE International Conference on Autonomic Computing (ICAC).

[29]  T. Moscibroda,et al.  Protean: VM Allocation Service at Scale , 2020, OSDI.

[30]  Edward Edberg Halim,et al.  LinnOS: Predictability on Unpredictable Flash Storage with a Light Neural Network , 2020, OSDI.

[31]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.