A Scalable and Cloud-Native Hyperparameter Tuning System

In this paper, we introduce Katib: a scalable, cloud-native, and production-ready hyperparameter tuning system that is agnostic of the underlying machine learning framework. Though there are multiple hyperparameter tuning systems available, this is the first one that caters to the needs of both users and administrators of the system. We present the motivation and design of the system and contrast it with existing hyperparameter tuning systems, especially in terms of multi-tenancy, scalability, fault-tolerance, and extensibility. It can be deployed on local machines, or hosted as a service in on-premise data centers, or in private/public clouds. We demonstrate the advantage of our system using experimental results as well as real-world, production use cases. Katib has active contributors from multiple companies and is open-sourced at \emph{this https URL} under the Apache 2.0 license.

[1]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[2]  Matei Zaharia,et al.  Provenance Analysis for Missing Answers and Integrity Repairs. , 2018 .

[3]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[4]  Anubhav Garg,et al.  Katib: A Distributed General AutoML Platform on Kubernetes , 2019, OpML.

[5]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[6]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[7]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[8]  Manasi Vartak,et al.  ModelDB: a system for machine learning model management , 2016, HILDA '16.

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[11]  Ruud C. M. de Rooij,et al.  Chaos Engineering , 2017, IEEE Software.

[12]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[13]  Ali Ghodsi,et al.  Accelerating the Machine Learning Lifecycle with MLflow , 2018, IEEE Data Eng. Bull..

[14]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[17]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[18]  Chris Eliasmith,et al.  Hyperopt: a Python library for model selection and hyperparameter optimization , 2015 .

[19]  Wei Chu,et al.  Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors , 2020, AAAI.

[20]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[21]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[22]  Doubletree Hotel San Jose,et al.  The World's Most Popular Open Source Database , 2003 .