论文信息 - Cerebro: A Data System for Optimized Deep Learning Model Selection

Cerebro: A Data System for Optimized Deep Learning Model Selection

Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to wider adoption: the pain and resource intensiveness of model selection. This empirical process involves exploring deep net architectures and hyper-parameters, often requiring hundreds of trials. Alas, most ML systems focus on training one model at a time, reducing throughput and raising overall resource costs; some also sacrifice reproducibility. We present Cerebro, a new data system to raise deep net model selection throughput at scale without raising resource costs and without sacrificing reproducibility or accuracy. Cerebro uses a new parallel SGD execution strategy we call model hopper parallelism that hybridizes taskand data-parallelism to mitigate the cons of these prior paradigms and offer the best of both worlds. Experiments on large ML benchmark datasets show that Cerebro offers 3x to 10x runtime savings relative to data-parallel systems like Horovod and Parameter Server and up to 8x memory/storage savings or up to 100x network savings relative to task-parallel systems. Cerebro also supports heterogeneous resources and fault tolerance. PVLDB Reference Format: Supun Nakandala, Yuhao Zhang, and Arun Kumar. Cerebro: A Data System for Optimized Deep Learning Model Selection. PVLDB, 13(11): 2159-2173, 2020. DOI: https://doi.org/10.14778/3407790.3407816

[1] Yoshua Bengio,et al. Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[2] Dimitri P. Bertsekas,et al. A New Class of Incremental Gradient Methods for Least Squares Problems , 1997, SIAM J. Optim..

[3] Christopher Ré,et al. Probabilistic Management of OCR Data using an RDBMS , 2011, Proc. VLDB Endow..

[4] L. Bottou. Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms , 2009 .

[5] Amar Phanishayee,et al. Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution , 2018 .

[6] Hang Su,et al. Experiments on Parallel Training of Deep Neural Network using Model Averaging , 2015, ArXiv.

[7] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[8] Yannis Papakonstantinou,et al. Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations , 2019, SIGMOD Conference.

[9] Aaron Klein,et al. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[10] Kang G. Shin,et al. Tiresias: A GPU Cluster Manager for Distributed Deep Learning , 2019, NSDI.

[11] Kun Li,et al. The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[12] Sanjay Chaudhary,et al. A survey on job scheduling algorithms in Big data processing , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[13] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[14] Yi Li,et al. Mariana: Tencent Deep Learning Platform and its Applications , 2014, Proc. VLDB Endow..

[15] Michael J. Freedman,et al. SLAQ: quality-driven scheduling for distributed machine learning , 2017, SoCC.

[16] David D. Cox,et al. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[17] Matthew Rocklin,et al. Better and faster hyperparameter optimization with Dask , 2019 .

[18] Supun Nakandala,et al. Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems , 2019, DEEM@SIGMOD.

[19] Xin Li,et al. Demonstration of Krypton: Optimized CNN Inference for Occlusion-based Deep CNN Explanations , 2019, Proc. VLDB Endow..

[20] Arun Kumar,et al. A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics , 2018, Proc. VLDB Endow..

[21] Chunbin Lin,et al. Accelerating Analytic Queries on Compressed Data , 2018 .

[22] David Antonio Justo. Write once, rewrite everywhere: A Unified Framework for Factorized Machine Learning , 2019 .

[23] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[24] Matthias Weidlich,et al. Crossbow: Scaling Deep Learning with Small Batch Sizes on Multi-GPU Servers , 2019, Proc. VLDB Endow..