An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks

Abstract Recurrent neural networks (RNN) have been successfully applied to various sequential decision-making tasks, natural language processing applications, and time-series predictions. Such networks are usually trained through back-propagation through time (BPTT) which is prohibitively expensive, especially when the length of the time dependencies and the number of hidden neurons increase. To reduce the training time, extreme learning machines (ELMs) have been recently applied to RNN training, reaching a 99% speedup on some applications. Due to its non-iterative nature, ELM training, when parallelized, has the potential to reach higher speedups than BPTT. In this work, we present Opt-PR-ELM, an optimized parallel RNN training algorithm based on ELM that takes advantage of the GPU shared memory and of parallel QR factorization algorithms to efficiently reach optimal solutions. The theoretical analysis of the proposed algorithm is presented on six RNN architectures, including LSTM and GRU, and its performance is empirically tested on ten time-series prediction applications. Opt-PR-ELM is shown to reach up to 461 times speedup over its sequential counterpart and to require up to 20x less time to train than parallel BPTT. Such high speedups over new generation CPUs are extremely crucial in real-time applications and IoT environments.

[1]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Fuzhen Zhuang,et al.  Parallel extreme learning machine for regression based on MapReduce , 2013, Neurocomputing.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[5]  S. Billings Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains , 2013 .

[6]  Qi-Jun Zhang,et al.  Parallel back-propagation neural network training technique using CUDA on multiple GPUs , 2015, 2015 IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO).

[7]  Hai Jin,et al.  Guest Editorial: Special Issue on Network and Parallel Computing for Emerging Architectures and Applications , 2019, International Journal of Parallel Programming.

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[10]  Mariette Awad,et al.  On extreme learning machines in sequential and time series prediction: A non-iterative and approximate training algorithm for recurrent neural networks , 2019, Neurocomputing.

[11]  Dean Zhao,et al.  A novel optimized GA–Elman neural network algorithm , 2017, Neural Computing and Applications.

[12]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[13]  Xavier Sierra-Canto,et al.  Parallel Training of a Back-Propagation Neural Network Using CUDA , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Yang Feng,et al.  Memory visualization for gated recurrent neural networks in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jong-Hwan Kim,et al.  Online recurrent extreme learning machine and its application to time-series prediction , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[17]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[18]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[19]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[22]  YuGe,et al.  Parallel ensemble of online sequential extreme learning machine based on MapReduce , 2016 .

[23]  Gennady Pekhimenko,et al.  Scaling Back-propagation by Parallel Scan Algorithm , 2019, ArXiv.

[24]  Yu Liu,et al.  Parallel online sequential extreme learning machine based on MapReduce , 2015, Neurocomputing.

[25]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[26]  Erkki Oja,et al.  GPU-accelerated and parallelized ELM ensembles for large-scale regression , 2011, Neurocomputing.

[27]  Yong Huang,et al.  Convergence Study in Extended Kalman Filter-Based Training of Recurrent Neural Networks , 2011, IEEE Transactions on Neural Networks.

[28]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[30]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[31]  Olga Radyvonenko,et al.  Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization , 2016, 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP).

[32]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[33]  Les E. Atlas,et al.  Recurrent neural networks and robust time series prediction , 1994, IEEE Trans. Neural Networks.

[34]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[35]  Shouyi Yin,et al.  A fast and power efficient architecture to parallelize LSTM based RNN for cognitive intelligence applications , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[37]  Ge Yu,et al.  Parallel ensemble of online sequential extreme learning machine based on MapReduce , 2016, Neurocomputing.

[38]  Ömer Faruk Ertuğrul,et al.  Forecasting electricity load by a novel recurrent extreme learning machines approach , 2016 .

[39]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[40]  Armando Blanco,et al.  A real-coded genetic algorithm for training recurrent neural networks , 2001, Neural Networks.

[41]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[42]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[43]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[44]  Hubert A.B. Te Braake,et al.  Random activation weight neural net (RAWN) for fast non-iterative training. , 1995 .