Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

The paper presents an investigation of attention mechanisms in end-to-end Russian speech recognition system created by join Connectional Temporal Classification model and attention-based encoder-decoder. We trained the models on a small dataset of Russian speech with total duration of about 60 h, and performed pretraining of the models using transfer learning with English as non-target language. We experimented with following types of attention mechanism: coverage-based attention and 2D location-aware attention as well as their combination. At the decoding stage we used beam search pruning method and gumbel-softmax function instead of softmax. We have achieved 4% relative word error rate reduction using 2D location-aware attention.

[1]  Yonghong Yan,et al.  Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[5]  Alexander L. Ronzhin,et al.  HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech , 2016, SPECOM.

[6]  Irina S. Kipyatkova,et al.  Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition , 2019, SPECOM.

[7]  George Saon,et al.  Advancing Sequence-to-Sequence Based Speech Recognition , 2019, INTERSPEECH.

[8]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[9]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10]  Irina S. Kipyatkova Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition , 2017, SPECOM.

[11]  Ирина Сергеевна Кипяткова,et al.  Аналитический обзор интегральных систем распознавания речи , 2018 .

[12]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[13]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  A. A. Karpov,et al.  Information enquiry kiosk with multimodal user interface , 2009, Pattern Recognition and Image Analysis.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Alexey Karpov,et al.  Class-based LSTM Russian Language Model with Linguistic Information , 2020, LREC.

[17]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[18]  Zhiheng Huang,et al.  Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[20]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).