论文信息 - Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

The paper presents an investigation of attention mechanisms in end-to-end Russian speech recognition system created by join Connectional Temporal Classification model and attention-based encoder-decoder. We trained the models on a small dataset of Russian speech with total duration of about 60 h, and performed pretraining of the models using transfer learning with English as non-target language. We experimented with following types of attention mechanism: coverage-based attention and 2D location-aware attention as well as their combination. At the decoding stage we used beam search pruning method and gumbel-softmax function instead of softmax. We have achieved 4% relative word error rate reduction using 2D location-aware attention.

Irina S. Kipyatkova | Nikita Markovnikov

[1] Yonghong Yan,et al. Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[2] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[5] Alexander L. Ronzhin,et al. HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech , 2016, SPECOM.

[6] Irina S. Kipyatkova,et al. Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition , 2019, SPECOM.

[7] George Saon,et al. Advancing Sequence-to-Sequence Based Speech Recognition , 2019, INTERSPEECH.

[8] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[9] Hank Liao,et al. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[10] Irina S. Kipyatkova. Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition , 2017, SPECOM.

[11] Ирина Сергеевна Кипяткова,et al. Аналитический обзор интегральных систем распознавания речи , 2018 .

[12] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[13] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] A. A. Karpov,et al. Information enquiry kiosk with multimodal user interface , 2009, Pattern Recognition and Image Analysis.

[15] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16] Alexey Karpov,et al. Class-based LSTM Russian Language Model with Linguistic Information , 2020, LREC.

[17] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[18] Zhiheng Huang,et al. Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[20] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).