Fully Convolutional Speech Recognition

Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling. This fully convolutional approach is trained end-to-end to predict characters from the raw waveform, removing the feature extraction step altogether. An external convolutional language model is used to decode words. On Wall Street Journal, our model matches the current state-of-the-art. On Librispeech, we report state-of-the-art performance among end-to-end models, including Deep Speech 2 trained with 12 times more acoustic data and significantly more linguistic data.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[3]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[4]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[7]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[8]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[9]  Dimitri Palaz,et al.  Convolutional Neural Networks-based continuous speech recognition using raw speech signal , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[11]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[12]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Iasonas Kokkinos,et al.  Learning Filterbanks from Raw Speech for Phone Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[19]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[20]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[21]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[22]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[23]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[24]  Gabriel Synnaeve,et al.  Letter-Based Speech Recognition with Gated ConvNets , 2017, ArXiv.

[25]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[26]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[27]  William Chan,et al.  Deep Recurrent Neural Networks for Acoustic Modelling , 2015, ArXiv.

[28]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..