Acoustic-to-Word Attention-Based Model Complemented with Character-Level CTC-Based Model

This paper addresses end-to-end speech recognition which directly maps acoustic features to a word sequence. The acoustic-to-word model is attractive since it does not require an external language model and an elaborate decoder, resulting in extremely simple and fast decoding. The apparent drawback of this modeling is sparseness of training data, particularly for less frequent words. In this paper, we propose a framework complemented with a character-level model. Joint training of the word-level model with the character-level model enhances the generality of deep learning of feature extraction and classification processes, preventing it from overfitting. Moreover, the character-level model is used to decode out-of-vocabulary (OOV) words that are not covered by the word-level model. Since there are choices of connectionist temporal classification (CTC) and attention-based models in the end-to-end recognition, we also explore optimal combination for the hybrid system. Evaluations on the Corpus of Spontaneous Japanese (CSJ) show that (1) the acoustic-to-word attention-based model outperforms CTC, (2) multitask learning (MTL) with character-level CTC model is effective, and (3) the hybrid system achieves comparable or even better accuracy than the standard DNN-HMM system with a decoding speed faster by a factor of 25.

[1]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[2]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[3]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[4]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[6]  Bhuvana Ramabhadran,et al.  Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[7]  Liang Lu,et al.  Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition , 2017, INTERSPEECH.

[8]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[10]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[11]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[12]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Liang Lu,et al.  On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[16]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[20]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  An Analysis of "Attention" in Sequence-to-Sequence Models , 2017, INTERSPEECH.