End-To-End Multi-Speaker Speech Recognition With Transformer

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.

[1]  Ke Li,et al.  A Time-Restricted Self-Attention Layer for ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  John R. Hershey,et al.  Multichannel End-to-end Speech Recognition , 2017, ICML.

[3]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[6]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[7]  Dong Yu,et al.  Monaural Multi-Talker Speech Recognition with Attention Mechanism and Gated Convolutional Networks , 2018, INTERSPEECH.

[8]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jonathan Le Roux,et al.  End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Shinji Watanabe,et al.  End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[14]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[15]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[16]  Shinji Watanabe,et al.  End-to-end Speech Recognition With Word-Based Rnn Language Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Hermann Ney,et al.  Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[18]  Reinhold Häb-Umbach,et al.  Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[19]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[21]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Jonathan Le Roux,et al.  A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[29]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[31]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).