Gear Training: A new way to implement high-performance model-parallel training

The training of Deep Neural Networks usually needs tremendous computing resources. Therefore many deep models are trained in large cluster instead of single machine or GPU. Though major researchs at present try to run whole model on all machines by using asynchronous asynchronous stochastic gradient descent (ASGD), we present a new approach to train deep model parallely -- split the model and then seperately train different parts of it in different speed.

[1]  Fathi M. Salem,et al.  Gate-variants of Gated Recurrent Unit (GRU) neural networks , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).

[2]  Nhien-An Le-Khac,et al.  Efficient Large Scale Clustering Based on Data Partitioning , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[3]  Florin Rusu,et al.  Stochastic Gradient Descent on Highly-Parallel Architectures , 2018, ArXiv.

[4]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[5]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[6]  Ryota Tomioka,et al.  AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks , 2017, ArXiv.

[7]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[8]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[9]  Janis Keuper,et al.  Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yang Wang,et al.  BigDL: A Distributed Deep Learning Framework for Big Data , 2018, SoCC.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Abhinav Vishnu,et al.  Distributed TensorFlow with MPI , 2016, ArXiv.