SpotTune: Leveraging Transient Resources for Cost-efficient Hyper-parameter Tuning in the Public Cloud

Hyper-parameter tuning (HPT) is crucial for many machine learning (ML) algorithms. But due to the large searching space, HPT is usually time-consuming and resource-intensive. Nowadays, many researchers use public cloud resources to train machine learning models, convenient yet expensive. How to speed up the HPT process while at the same time reduce cost is very important for cloud ML users. In this paper, we propose SpotTune, an approach that exploits transient revocable resources in the public cloud with some tailored strategies to do HPT in a parallel and cost-efficient manner. Orchestrating the HPT process upon transient servers, SpotTune uses two main techniques, fine-grained cost-aware resource provisioning, and ML training trend predicting, to reduce the monetary cost and runtime of HPT processes. Our evaluations show that SpotTune can reduce the cost by up to 90% and achieve a 16.61x performance-cost rate improvement.

[1]  Liang Zheng,et al.  How to Bid the Cloud , 2015, Comput. Commun. Rev..

[2]  Marc Cohen,et al.  Google Compute Engine , 2014 .

[3]  Bernd Bischl,et al.  Effectiveness of Random Search in SVM hyper-parameter tuning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[4]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[5]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[6]  Prateek Sharma,et al.  Portfolio-driven Resource Management for Transient Cloud Servers , 2017, SIGMETRICS.

[7]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[8]  Gregory R. Ganger,et al.  Tributary: spot-dancing for elastic services with latency SLOs , 2018, USENIX ATC.

[9]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[10]  Wei Wang,et al.  MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving , 2019, USENIX Annual Technical Conference.

[11]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[12]  Chuan Wu,et al.  Optimus: an efficient dynamic resource scheduler for deep learning clusters , 2018, EuroSys.

[13]  David E. Irwin,et al.  HotSpot: automated server hopping in cloud spot markets , 2017, SoCC.

[14]  Prateek Sharma,et al.  SpotWeb: Running Latency-sensitive Distributed Web Services on Transient Cloud Servers , 2019, HPDC.

[15]  Gang Luo,et al.  A review of automatic selection methods for machine learning algorithms and hyper-parameter values , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Luís E. T. Rodrigues,et al.  Hourglass: Leveraging Transient Resources for Time-Constrained Graph Processing in the Cloud , 2019, EuroSys.

[18]  Prateek Sharma,et al.  SpotOn: a batch computing service for the spot market , 2015, SoCC.

[19]  Fei Sha,et al.  Hyper-parameter Tuning under a Budget Constraint , 2019, IJCAI.

[20]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[21]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.