Efficient GPU implementation of convolutional neural networks for speech recognition

Deep learning has enjoyed tremendous success in speech recognition in recent years. Despite their widespread use, training large and complex architectures remains very time consuming. A prime example of this are convolutional neural networks (CNNs), which have provided state-of-the-art results, but are also among the most computationally intensive networks to train. In this paper, we study four different methods for GPU acceleration of CNNs: a native implementation using cuBLAS, two implementations based on NVIDIA’s recently released deep-learning cuDNN library, and an implementation based on cuFFT. We analyze the performance of each of these approaches on the forward operation, the gradient computation, and the backward propagation. The overall best performance is obtained using the custom native implementation, which was found to be up to 6.9 times faster than cuDNN. The paper concludes with results on the end-to-end training speed of our CNN network on an LVCSR task.

[1]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[3]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[5]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[6]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.