论文信息 - The STC System for the CHiME-6 Challenge

The STC System for the CHiME-6 Challenge

This paper is a description of the Speech Technology Center (STC) systems for the CHiME-6 challenge aimed at multimicrophone multi-speaker speech recognition and diarization in a dinner party scenario. We participated in both Track 1 and Track 2 and submitted our results for Ranking A as well as Ranking B for each track. The soft-activity based Guided Source Separation (GSS) as a front-end and a combination of advanced acoustic modeling techniques such as GSS-based training data augmentation, multi-stride and multi-stream self-attention layers, statistics layer and SpecAugment, as well as the lattice-level fusion of acoustic models were applied in the 1st track system. Our system for Track 1 was in the top three systems, achieving 30% relative WER reduction over the baseline. Additionally, lattice rescoring with a neural language model was applied for Ranking B. Overall, this led to 34% relative WER reduction over the baseline in Track 1. For Track 2, we proposed a novel Target-Speaker Voice Activity Detection (TS-VAD) approach to solve the diarization problem. Good diarization results made it possible to perform GSS on the obtained segments. TS-VAD is based on ivector speaker embeddings, which are initially estimated using a strong diarization system based on spectral clustering of xvectors. The back-end from the Track 1 system was used in the second track. The system for Track 2 demonstrated state-ofthe-art performance, outperforming the baseline by 39% DER, 45% JER, 43% WER (Ranking A) and 45% WER (Ranking B) relative.

[1] L. J. Griffiths,et al. An alternative approach to linearly constrained adaptive beamforming , 1982 .

[2] DeLiang Wang,et al. Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[3] Jacob Benesty,et al. On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Marco Matassoni,et al. An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[5] Haihua Xu,et al. Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[6] Tomohiro Nakatani,et al. Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7] George Saon,et al. Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[9] Martin Wolf,et al. Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..

[10] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[11] Sanjeev Khudanpur,et al. Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[12] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[13] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14] Reinhold Haeb-Umbach,et al. Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[15] Hermann Ney,et al. The RWTH/UPB system combination for the CHiME 2018 Workshop , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[16] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Jiqing Han,et al. Investigation of Monaural Front-End Processing for Robust ASR without Retraining or Joint-Training , 2018, ArXiv.

[18] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[19] Sergey Novoselov,et al. On deep speaker embeddings for text-independent speaker recognition , 2018, Odyssey.

[20] Reinhold Haeb-Umbach,et al. NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[21] Jon Barker,et al. The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[22] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23] Kyu J. Han,et al. Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[24] Naoyuki Kanda,et al. Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR , 2019, INTERSPEECH.

[25] Naoyuki Kanda,et al. Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition , 2019, INTERSPEECH.

[26] Reinhold Häb-Umbach,et al. An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27] Naoyuki Kanda,et al. Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[29] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30] Kyu J. Han,et al. State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31] Shrikanth Narayanan,et al. Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[32] End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Aleksei Romanenko,et al. Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario , 2020, INTERSPEECH.

[34] Galina Lavrentyeva,et al. Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances , 2020, Odyssey.

[35] Ivan Provilkov,et al. BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[36] Tomohiro Nakatani,et al. Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).