Cerebro: A Data System for Optimized Deep Learning Model Selection

Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to wider adoption: the pain and resource intensiveness of model selection. This empirical process involves exploring deep net architectures and hyper-parameters, often requiring hundreds of trials. Alas, most ML systems focus on training one model at a time, reducing throughput and raising overall resource costs; some also sacrifice reproducibility. We present Cerebro, a new data system to raise deep net model selection throughput at scale without raising resource costs and without sacrificing reproducibility or accuracy. Cerebro uses a new parallel SGD execution strategy we call model hopper parallelism that hybridizes taskand data-parallelism to mitigate the cons of these prior paradigms and offer the best of both worlds. Experiments on large ML benchmark datasets show that Cerebro offers 3x to 10x runtime savings relative to data-parallel systems like Horovod and Parameter Server and up to 8x memory/storage savings or up to 100x network savings relative to task-parallel systems. Cerebro also supports heterogeneous resources and fault tolerance. PVLDB Reference Format: Supun Nakandala, Yuhao Zhang, and Arun Kumar. Cerebro: A Data System for Optimized Deep Learning Model Selection. PVLDB, 13(11): 2159-2173, 2020. DOI: https://doi.org/10.14778/3407790.3407816

[1]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[2]  Dimitri P. Bertsekas,et al.  A New Class of Incremental Gradient Methods for Least Squares Problems , 1997, SIAM J. Optim..

[3]  Christopher Ré,et al.  Probabilistic Management of OCR Data using an RDBMS , 2011, Proc. VLDB Endow..

[4]  L. Bottou Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[5]  Amar Phanishayee,et al.  Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution , 2018 .

[6]  Hang Su,et al.  Experiments on Parallel Training of Deep Neural Network using Model Averaging , 2015, ArXiv.

[7]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[8]  Yannis Papakonstantinou,et al.  Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations , 2019, SIGMOD Conference.

[9]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[10]  Kang G. Shin,et al.  Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[11]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[12]  Sanjay Chaudhary,et al.  A survey on job scheduling algorithms in Big data processing , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[13]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[14]  Yi Li,et al.  Mariana: Tencent Deep Learning Platform and its Applications , 2014, Proc. VLDB Endow..

[15]  Michael J. Freedman,et al.  SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[16]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[17]  Matthew Rocklin,et al.  Better and faster hyperparameter optimization with Dask , 2019 .

[18]  Supun Nakandala,et al.  Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems , 2019, DEEM@SIGMOD.

[19]  Xin Li,et al.  Demonstration of Krypton: Optimized CNN Inference for Occlusion-based Deep CNN Explanations , 2019, Proc. VLDB Endow..

[20]  Arun Kumar,et al.  A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics , 2018, Proc. VLDB Endow..

[21]  Chunbin Lin,et al.  Accelerating Analytic Queries on Compressed Data , 2018 .

[22]  David Antonio Justo Write once, rewrite everywhere: A Unified Framework for Factorized Machine Learning , 2019 .

[23]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[24]  Matthias Weidlich,et al.  Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..

[25]  D. Atkin OR scheduling algorithms. , 2000, Anesthesiology.

[26]  Tibor Fiala An Algorithm for the Open-Shop Problem , 1983, Math. Oper. Res..

[27]  Gerhard J. Woeginger The Open Shop Scheduling Problem , 2018, STACS.

[28]  Jeffrey F. Naughton,et al.  Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent , 2017, SIGMOD Conference.

[29]  Ching-Yung Lin,et al.  Efficient Multi-training Framework of Image Deep Learning on GPU Cluster , 2015, 2015 IEEE International Symposium on Multimedia (ISM).

[30]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[31]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[32]  Yannis Papakonstantinou,et al.  Incremental and Approximate Computations for Accelerating Deep CNN Inference , 2020, ACM Trans. Database Syst..

[33]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[34]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[35]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[36]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[37]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[38]  Kuang-Ching Wang,et al.  The Design and Operation of CloudLab , 2019, USENIX ATC.

[39]  Raul Castro Fernandez,et al.  Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[40]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[41]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[42]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[43]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[44]  Christopher Ré,et al.  Towards a unified architecture for in-RDBMS analytics , 2012, SIGMOD Conference.

[45]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[46]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[47]  Remzi H. Arpaci-Dusseau Operating Systems: Three Easy Pieces , 2015, login Usenix Mag..

[48]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[49]  Ameet Talwalkar,et al.  Massively Parallel Hyperparameter Tuning , 2018, ArXiv.

[50]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[51]  Teofilo F. Gonzalez,et al.  Open Shop Scheduling to Minimize Finish Time , 1976, JACM.

[52]  Sung-Bae Cho,et al.  Predicting residential energy consumption using CNN-LSTM neural networks , 2019, Energy.

[53]  Erik Demeulemeester,et al.  Resource-constrained project scheduling: A survey of recent developments , 1998, Comput. Oper. Res..

[54]  U. Rajendra Acharya,et al.  Automated diagnosis of arrhythmia using combination of CNN and LSTM techniques with variable length heart beats , 2018, Comput. Biol. Medicine.

[55]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[56]  Wencong Xiao,et al.  Gandiva: Introspective Cluster Scheduling for Deep Learning , 2018, OSDI.

[57]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Jun Yang,et al.  Data Management in Machine Learning Systems , 2019, Data Management in Machine Learning Systems.

[60]  Supun Nakandala,et al.  Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale , 2020, SIGMOD Conference.

[61]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[62]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.