Training Long Short-Term Memory With Sparsified Stochastic Gradient Descent

Prior work has demonstrated that exploiting the sparsity can dramatically improve the energy efficiency and shrink the memory footprint of Convolutional Neural Networks (CNNs). However, these sparsity-centric optimization techniques might be less effective for Long Short-Term Memory (LSTM) based Recurrent Neural Networks (RNNs), especially for the training phase, because of the significant structural difference between the neurons. To investigate if there is possible sparsity-centric optimization for training LSTM-based RNNs, we studied several applications and observed that there is potential sparsity in the gradients generated in the backward propagation. In this paper, we illustrate why the sparsity exists and propose a simple yet effective thresholding technique to induce further more sparsity during training an LSTM-based RNN training. Experiment results show that the proposed technique can increase the sparsity of linear gate gradients to higher than 80\% without loss of performance, which makes more than 50\% multiply-accumulate (MAC) operations redundant in an entire LSTM training process. These redudant MAC operations can be eliminated by hardware techniques to improve the energy efficiency and training speed of LSTM-based RNNs.