论文信息 - Speaker-Aware Training of Attention-Based End-to-End Speech Recognition Using Neural Speaker Embeddings

Speaker-Aware Training of Attention-Based End-to-End Speech Recognition Using Neural Speaker Embeddings

In speaker-aware training, a speaker embedding is appended to DNN input features. This allows the DNN to effectively learn representations, which are robust to speaker variability.We apply speaker-aware training to attention-based end-to-end speech recognition. We show that it can improve over a purely end-to-end baseline. We also propose speaker-aware training as a viable method to leverage untranscribed, speaker annotated data.We apply state-of-the-art embedding approaches, both i-vectors and neural embeddings, such as x-vectors. We experiment with embeddings trained in two conditions: on the fixed ASR data, and on a large untranscribed dataset. We run our experiments on the TED-LIUM and Wall Street Journal datasets. No embedding consistently outperforms all others, but in many settings neural embeddings outperform i-vectors.

[1] Yashesh Gaur,et al. Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[2] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[3] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[6] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7] Steve Renals,et al. Embeddings for DNN Speaker Adaptive Training , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Joon Son Chung,et al. Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] M. Cugmas,et al. On comparing partitions , 2015 .

[10] Sanjeev Khudanpur,et al. End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[11] Tara N. Sainath,et al. An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[13] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Lukás Burget,et al. Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] John H. L. Hansen,et al. Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[17] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Paul Deléglise,et al. Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[19] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Sanjeev Khudanpur,et al. Probing the Information Encoded in X-Vectors , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[22] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23] Bhuvana Ramabhadran,et al. Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[24] Xiaodong Cui,et al. Embedding-Based Speaker Adaptive Training of Deep Neural Networks , 2017, INTERSPEECH.

[25] Yannick Estève,et al. Evaluation of Feature-Space Speaker Adaptation for End-to-End Acoustic Models , 2018, LREC.

[26] Shinji Watanabe,et al. Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.