Towards End-to-End Speech Recognition with Deep Multipath Convolutional Neural Networks

Approaches to deep learning have been used all over in connection to Automatic Speech Recognition (ASR), where they have achieved a high level of accuracy. This has mostly been seen in Convolutional Neural Network (CNN) which has recently been investigated in ASR. Due to the fact that CNN has an increased network’s depth on one branch, and may not be wide enough to work on capturing adequate features on signals of human speech. We focus on a proposal for an architecture that is deep and wide in CNN referred to as Multipath Convolutional Neural Network (MCNN). MCNN-CTC combines three additional paths with Connectionist Temporal Classification (CTC) objective function, and can be defined as an end-to-end system that has the ability to fully exploit spectral and temporal structures related to speech signals simultaneously. Results from the experiments show that the newly proposed MCNN-CTC structure enables a reduction in the error rate arising from the construction of end-to-end acoustic model. In the absence of a Language Model (LM), our proposed MCNN-CTC acoustic model has a relative reduction of 1.10%–12.08% comparing to the traditional HMM-based or DCNN-CTC-based models with strong generalization performance.

[1]  Jie Li,et al.  Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[2]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[4]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[5]  Hu Hu,et al.  Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[7]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[10]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[13]  Xiangang Li,et al.  Comparable Study Of Modeling Units For End-To-End Mandarin Speech Recognition , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Hui Zhang,et al.  Mongolian Speech Recognition Based on Deep Neural Networks , 2015, CCL.

[16]  Dong Wang,et al.  THCHS-30 : A Free Chinese Speech Corpus , 2015, ArXiv.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[21]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.