E2E-SINCNET: Toward Fully End-To-End Speech Recognition

Modern end-to-end (E2E) Automatic Speech Recognition (ASR) systems rely on Deep Neural Networks (DNN) that are mostly trained on handcrafted and pre-computed acoustic features such as Mel-filter-banks or Mel-frequency cepstral coefficients. Nonetheless, and despite worse performances, E2E ASR models processing raw waveforms are an active research field due to the lossless nature of the input signal. In this paper, we propose the E2E-SincNet, a novel fully E2E ASR model that goes from the raw waveform to the text transcripts by merging two recent and powerful paradigms: SincNet and the joint CTC-attention training scheme. The conducted experiments on two different speech recognition tasks show that our approach outperforms previously investigated E2E systems relying either on the raw waveform or pre-computed acoustic features, with a reported top-of-the-line Word Error Rate (WER) of 4.7% on the Wall Street Journal (WSJ) dataset.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[4]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[6]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[7]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[8]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[9]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[10]  Steve Renals,et al.  On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters , 2019, INTERSPEECH.

[11]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[13]  Yoshua Bengio,et al.  Interpretable Convolutional Filters with SincNet , 2018, ArXiv.

[14]  Yoshua Bengio,et al.  Speech and Speaker Recognition from Raw Waveform with SincNet , 2018, ArXiv.

[15]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[16]  Zhiheng Huang,et al.  Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Satoshi Nakamura,et al.  Attention-based Wav2Text with feature transfer learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[20]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[21]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Dimitri Palaz,et al.  End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks , 2013, ArXiv.

[23]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Sabrina Hirsch,et al.  Digital Signal Processing A Computer Based Approach , 2016 .

[25]  Nanyun Peng,et al.  Espresso: A Fast End-to-End Neural Speech Recognition Toolkit , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[27]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[28]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.