The STC System for the CHiME-6 Challenge

This paper is a description of the Speech Technology Center (STC) systems for the CHiME-6 challenge aimed at multimicrophone multi-speaker speech recognition and diarization in a dinner party scenario. We participated in both Track 1 and Track 2 and submitted our results for Ranking A as well as Ranking B for each track. The soft-activity based Guided Source Separation (GSS) as a front-end and a combination of advanced acoustic modeling techniques such as GSS-based training data augmentation, multi-stride and multi-stream self-attention layers, statistics layer and SpecAugment, as well as the lattice-level fusion of acoustic models were applied in the 1st track system. Our system for Track 1 was in the top three systems, achieving 30% relative WER reduction over the baseline. Additionally, lattice rescoring with a neural language model was applied for Ranking B. Overall, this led to 34% relative WER reduction over the baseline in Track 1. For Track 2, we proposed a novel Target-Speaker Voice Activity Detection (TS-VAD) approach to solve the diarization problem. Good diarization results made it possible to perform GSS on the obtained segments. TS-VAD is based on ivector speaker embeddings, which are initially estimated using a strong diarization system based on spectral clustering of xvectors. The back-end from the Track 1 system was used in the second track. The system for Track 2 demonstrated state-ofthe-art performance, outperforming the baseline by 39% DER, 45% JER, 43% WER (Ranking A) and 45% WER (Ranking B) relative.

[1]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[2]  DeLiang Wang,et al.  Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[3]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Marco Matassoni,et al.  An auditory based modulation spectral feature for reverberant speech recognition , 2010, INTERSPEECH.

[5]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[6]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[8]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[9]  Martin Wolf,et al.  Channel selection measures for multi-microphone speech recognition , 2014, Speech Commun..

[10]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[11]  Sanjeev Khudanpur,et al.  Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[13]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[14]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[15]  Hermann Ney,et al.  The RWTH/UPB system combination for the CHiME 2018 Workshop , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[16]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jiqing Han,et al.  Investigation of Monaural Front-End Processing for Robust ASR without Retraining or Joint-Training , 2018, ArXiv.

[18]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[19]  Sergey Novoselov,et al.  On deep speaker embeddings for text-independent speaker recognition , 2018, Odyssey.

[20]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[21]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[22]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Kyu J. Han,et al.  Multi-Stride Self-Attention for Speech Recognition , 2019, INTERSPEECH.

[24]  Naoyuki Kanda,et al.  Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR , 2019, INTERSPEECH.

[25]  Naoyuki Kanda,et al.  Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition , 2019, INTERSPEECH.

[26]  Reinhold Häb-Umbach,et al.  An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Naoyuki Kanda,et al.  Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[29]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Shrikanth Narayanan,et al.  Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[32]  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Aleksei Romanenko,et al.  Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario , 2020, INTERSPEECH.

[34]  Galina Lavrentyeva,et al.  Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances , 2020, Odyssey.

[35]  Ivan Provilkov,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[36]  Tomohiro Nakatani,et al.  Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).