Recurrent convolutional neural network for speech processing

Different neural networks have exhibited excellent performance on various speech processing tasks, and they usually have specific advantages and disadvantages. We propose to use a recently developed deep learning model, recurrent convolutional neural network (RCNN), for speech processing, which inherits some merits of recurrent neural network (RNN) and convolutional neural network (CNN). The core module can be viewed as a convolutional layer embedded with an RNN, which enables the model to capture both temporal and frequency dependance in the spectrogram of the speech in an efficient way. The model is tested on speech corpus TIMIT for phoneme recognition and IEMOCAP for emotion recognition. Experimental results show that the model is competitive with previous methods in terms of accuracy and efficiency.

[1]  Xiaolin Hu,et al.  Recurrent convolutional neural network for object recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[5]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[6]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[7]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[8]  László Tóth Convolutional deep rectifier neural nets for phone recognition , 2013, INTERSPEECH.

[9]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[10]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13]  Jürgen Schmidhuber,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[14]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[18]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[19]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[20]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[22]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[25]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[26]  László Tóth,et al.  Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[28]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[29]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[30]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Y. X. Zou,et al.  An experimental study of speech emotion recognition based on deep convolutional neural networks , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[32]  Xiaolin Hu,et al.  Convolutional Neural Networks with Intra-Layer Recurrent Connections for Scene Labeling , 2015, NIPS.

[33]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .