论文信息 - Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

This paper investigates the stochastic optimization problem with a focus on developing scalable parallel algorithms for deep learning tasks. Our solution involves a reformation of the objective function for stochastic optimization in neural network models, along with a novel parallel strategy, coined weighted aggregating stochastic gradient descent (WASGD). Following a theoretical analysis on the characteristics of the new objective function, WASGD introduces a decentralized weighted aggregating scheme based on the performance of local workers. Without any center variable, the new method automatically assesses the importance of local workers and accepts them according to their contributions. Furthermore, we have developed an enhanced version of the method, WASGD+, by (1) considering a designed sample order and (2) applying a more advanced weight evaluating function. To validate the new method, we benchmark our schemes against several popular algorithms including the state-of-the-art techniques (e.g., elastic averaging SGD) in training deep neural networks for classification tasks. Comprehensive experiments have been conducted on four classic datasets, including the CIFAR-100, CIFAR-10, Fashion-MNIST, and MNIST. The subsequent results suggest the superiority of the WASGD scheme in accelerating the training of deep architecture. Better still, the enhanced version, WASGD+, has been shown to be a significant improvement over its basic version.

[1] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[2] Aaron Roth,et al. The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[3] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[4] Dan Alistarh,et al. The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory , 2018, PODC.

[5] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[6] Shanshan Li,et al. An iterative algorithm for optimal variable weighting in K-means clustering , 2019, Commun. Stat. Simul. Comput..

[7] Yuefan Deng,et al. Multi-User Mobile Sequential Recommendation: An Efficient Parallel Computing Paradigm , 2018, KDD.

[8] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[9] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.

[10] Ioannis Mitliagkas,et al. Parallel SGD: When does averaging help? , 2016, ArXiv.

[11] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[13] Yuefan Deng,et al. Parallel Simulated Annealing by Mixing of States , 1999 .

[14] Fuzhen Zhuang,et al. Shared Structure Learning for Multiple Tasks with Multiple Views , 2013, ECML/PKDD.

[15] L. Bottou. Stochastic Gradient Learning in Neural Networks , 1991 .

[16] Hamed Haddadi,et al. Deep Learning in Mobile and Wireless Networking: A Survey , 2018, IEEE Communications Surveys & Tutorials.

[17] Yuefan Deng,et al. Applying Simulated Annealing and Parallel Computing to the Mobile Sequential Recommendation , 2019, IEEE Transactions on Knowledge and Data Engineering.

[18] R. J. Paul,et al. Optimization Theory: The Finite Dimensional Case , 1977 .

[19] Yuefan Deng,et al. A Unified Theory of the Mobile Sequential Recommendation Problem , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[20] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[21] Jianping Yin,et al. Distributed and asynchronous Stochastic Gradient Descent with variance reduction , 2017, Neurocomputing.

[22] H. Robbins. A Stochastic Approximation Method , 1951 .

[23] Frank Wood,et al. Bayesian Distributed Stochastic Gradient Descent , 2018, NeurIPS.

[24] Mohamad Ivan Fanany,et al. Simulated Annealing Algorithm for Deep Learning , 2015 .

[25] Edilson de Aguiar,et al. Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order , 2017, Pattern Recognit..

[26] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[27] Chung-Hsing Yeh,et al. Task oriented weighting in multi-criteria analysis , 1999, Eur. J. Oper. Res..

[28] Dieter Wolf-Gladrow,et al. 5. Lattice Boltzmann Models , 2000 .

[29] Dacheng Tao,et al. Joint Deep Multi-View Learning for Image Clustering , 2020, IEEE Transactions on Knowledge and Data Engineering.

[30] P. Bullen. Handbook of means and their inequalities , 1987 .

[31] Yi Chen,et al. Differentiating search results on structured data , 2012, TODS.

[32] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[33] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[34] Raul Castro Fernandez,et al. Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[35] Dimitris S. Papailiopoulos,et al. Cyclades: Conflict-free Asynchronous Machine Learning , 2016, NIPS.

[36] Keli Xiao,et al. A Weighted Aggregating SGD for Scalable Parallelization in Deep Learning , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[37] D. Wolf-Gladrow. Lattice-Gas Cellular Automata and Lattice Boltzmann Models: An Introduction , 2000 .

[38] Lijun Zhang,et al. VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning , 2018, IEEE Transactions on Knowledge and Data Engineering.

[39] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[40] Cho-Jui Hsieh,et al. HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[41] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[42] Sebastian Ruder,et al. An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[43] Keli Xiao,et al. A Parallel Simulated Annealing Enhancement of the Optimal-Matching Heuristic for Ridesharing , 2019, 2019 IEEE International Conference on Data Mining (ICDM).