Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training. Good efficiency, i.e., fast completion time of running a specific ML job, therefore, is a key feature of a successful ML system. While the completion time of a long-running ML job is determined by the time required to reach model convergence, practically that is also largely influenced by the values of various system settings. In this paper, we contribute techniques towards building self-tuning parameter servers. Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested model, the system is also iteratively learning which system setting is more efficient for that job and applies it online. While our techniques are general enough to various PS-style ML systems, we have prototyped our techniques on top of TensorFlow. Experiments show that our techniques can reduce the completion times of a variety of long-running TensorFlow jobs from 1.4x to 18x.

[1]  Tim Kraska,et al.  TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries , 2015, ArXiv.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[4]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[5]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[6]  Jeffrey F. Naughton,et al.  Toward Progress Indicators on Steroids for Big Data Systems , 2013, CIDR.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[9]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[12]  Dan Alistarh,et al.  Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study , 2018, EDBT.

[13]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[14]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[15]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[16]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[17]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[18]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[19]  Hal Daumé,et al.  Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback , 2017, EMNLP.

[20]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[22]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[23]  Sanjay Chawla,et al.  A Cost-based Optimizer for Gradient Descent Optimization , 2017, SIGMOD Conference.

[24]  Jeffrey F. Naughton,et al.  Toward a progress indicator for database queries , 2004, SIGMOD '04.

[25]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[26]  Magdalena Balazinska,et al.  Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[27]  Nicolas Vayatis,et al.  Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration , 2013, ECML/PKDD.

[28]  Wenjian Xu,et al.  Fast Multi-Column Sorting in Main-Memory Column-Stores , 2016, SIGMOD Conference.

[29]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[30]  Sergio Escalera,et al.  A brief Review of the ChaLearn AutoML Challenge: Any-time Any-dataset Learning without Human Intervention , 2016, AutoML@ICML.

[31]  Guy M. Lohman,et al.  Is query optimization a 'solved' problem? , 1989 .

[32]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[33]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[34]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[35]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[36]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[37]  Ofer Meshi,et al.  Smooth and Strong: MAP Inference with Linear Convergence , 2015, NIPS.

[38]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[39]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[40]  A. Stephen McGough,et al.  Predicting the Computational Cost of Deep Learning Models , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[41]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[42]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[43]  Tie-Yan Liu,et al.  Convergence Analysis of Distributed Stochastic Gradient Descent with Shuffling , 2017, Neurocomputing.

[44]  Shirish Tatikonda,et al.  Resource Elasticity for Large-Scale Machine Learning , 2015, SIGMOD Conference.

[45]  Felix Naumann,et al.  Cardinality Estimation: An Experimental Survey , 2017, Proc. VLDB Endow..

[46]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[47]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[48]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[49]  Joseph Gonzalez,et al.  Hemingway: Modeling Distributed Optimization Algorithms , 2017, ArXiv.

[50]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[51]  Ioannis Mitliagkas,et al.  Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs , 2016, ArXiv.

[52]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[53]  Louise Scott,et al.  Process engineering with Spearmint/sup TM//EPG , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[54]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[55]  Tian Li,et al.  Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..

[56]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[57]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[58]  Lars Kotthoff,et al.  Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA , 2017, J. Mach. Learn. Res..

[59]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[60]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[61]  Surajit Chaudhuri,et al.  Estimating Progress of Long Running SQL Queries , 2004, SIGMOD Conference.

[62]  Olatunji Ruwase,et al.  Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems , 2015, KDD.