Is Systematic Data Sharding able to Stabilize Asynchronous Parameter Server Training?

Over the last years, deep learning has gained an increase in popularity in various domains introducing complex models to handle the data explosion. However, while such model architectures can support the enormous amount of data, a single computing node cannot train the model using the whole data set in a timely fashion. Thus, specialized distributed architectures have been proposed, most of which follow data parallelism schemes, as the widely used parameter server approach. In this setup, each worker contributes to the training process in an asynchronous manner. While asynchronous training does not suffer from synchronization overheads, it introduces the problem of stale gradients which might cause the model to diverge during the training process. In this paper, we examine different data assignment schemes to workers, which facilitate the asynchronous learning approach. Specifically, we propose two different algorithms to perform the data sharding. Our experimental evaluation indicated that when stratification is taken into account the validation results present up to 6X less variance compared to standard sharding creation. When further data exploration for hidden stratification is performed, validation metrics can be slightly optimized. This method also achieves to reduce the variance of training and validation metrics by up to 8X and 2X respectively.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[3]  Yan Xu,et al.  Autotune: A Derivative-free Optimization Framework for Hyperparameter Tuning , 2018, KDD.

[4]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[5]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[6]  Aditya Akella,et al.  Network-accelerated distributed machine learning for multi-tenant settings , 2020, SoCC.

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[9]  Ashutosh Vyas,et al.  Deep Learning for Natural Language Processing , 2016 .

[10]  SELIS BDA: Big Data Analytics for the Logistics Domain , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[11]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[12]  Adrián Castelló,et al.  Analysis of model parallelism for distributed neural networks , 2019, EuroMPI.

[13]  Yang Wang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.

[14]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[15]  Qiang Wang,et al.  Benchmarking State-of-the-Art Deep Learning Software Tools , 2016, 2016 7th International Conference on Cloud Computing and Big Data (CCBD).

[16]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[17]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[18]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[19]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[20]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[21]  Nikodimos Provatas Exploiting Data Distribution in Distributed Learning of Deep Classification Models under the Parameter Server Architecture , 2021, PhD@VLDB.

[22]  David Gilmore,et al.  Modeling Order in Neural Word Embeddings at Scale , 2015, ICML.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Grigorios Tsoumakas,et al.  On the Stratification of Multi-label Data , 2011, ECML/PKDD.

[25]  Gustavo Carneiro,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[26]  Ayman El-Baz,et al.  Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers , 2018, Journal of Medical Systems.

[27]  Joel Nishimura,et al.  Restreaming graph partitioning: simple versatile algorithms for advanced balancing , 2013, KDD.

[28]  Nectarios Koziris,et al.  General-Purpose vs. Specialized Data Analytics Systems: A Game of ML & SQL Thrones , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[32]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[36]  Rachid Guerraoui,et al.  Asynchronous Byzantine Machine Learning ( the case of SGD ) Supplementary Material , 2022 .

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[39]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[40]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[41]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[42]  Beng Chin Ooi,et al.  DSH: data sensitive hashing for high-dimensional k-nnsearch , 2014, SIGMOD Conference.

[43]  Phillip B. Gibbons,et al.  The Non-IID Data Quagmire of Decentralized Machine Learning , 2019, ICML.

[44]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[45]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[46]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[47]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[48]  Nectarios Koziris,et al.  BigOptiBase: Big Data Analytics for Base Station Energy Consumption Optimization , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.