Speaker-Aware Training of Attention-Based End-to-End Speech Recognition Using Neural Speaker Embeddings

In speaker-aware training, a speaker embedding is appended to DNN input features. This allows the DNN to effectively learn representations, which are robust to speaker variability.We apply speaker-aware training to attention-based end-to-end speech recognition. We show that it can improve over a purely end-to-end baseline. We also propose speaker-aware training as a viable method to leverage untranscribed, speaker annotated data.We apply state-of-the-art embedding approaches, both i-vectors and neural embeddings, such as x-vectors. We experiment with embeddings trained in two conditions: on the fixed ASR data, and on a large untranscribed dataset. We run our experiments on the TED-LIUM and Wall Street Journal datasets. No embedding consistently outperforms all others, but in many settings neural embeddings outperform i-vectors.

[1]  Yashesh Gaur,et al.  Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[2]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[3]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[6]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7]  Steve Renals,et al.  Embeddings for DNN Speaker Adaptive Training , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  M. Cugmas,et al.  On comparing partitions , 2015 .

[10]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[11]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[13]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[17]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[19]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  Probing the Information Encoded in X-Vectors , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Bhuvana Ramabhadran,et al.  Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[24]  Xiaodong Cui,et al.  Embedding-Based Speaker Adaptive Training of Deep Neural Networks , 2017, INTERSPEECH.

[25]  Yannick Estève,et al.  Evaluation of Feature-Space Speaker Adaptation for End-to-End Acoustic Models , 2018, LREC.

[26]  Shinji Watanabe,et al.  Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.