DiaCorrect: End-to-end error correction for speaker diarization

In recent years, speaker diarization has attracted widespread atten-tion. To achieve better performance, some studies propose to diarize speech in multiple stages. Although these methods might bring additional benefits, most of them are quite complex. Motivated by spelling correction in automatic speech recognition (ASR), in this paper, we propose an end-to-end error correction framework, termed DiaCorrect, to refine the initial diarization results in a simple but efficient way. By exploiting the acoustic interactions between input mixture and its corresponding speaker activity, DiaCorrect could automatically adapt the initial speaker activity to minimize the diarization errors. Without bells and whistles, experiments on LibriSpeech based 2-speaker meeting-like data show that, the self-attentitive end-to-end neural diarization (SA-EEND) baseline with DiaCorrect could reduce its diarization error rate (DER) by over 62.4% from 12.31% to 4.63%. Our source code is available online at https://github.com/jyhan03/diacorrect .

[1]  Ming Li,et al.  Incorporating End-to-End Framework Into Target-Speaker Voice Activity Detection , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Y. Qian,et al.  Summary on the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Ming Li,et al.  Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  L. Burget,et al.  DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  H. Kim,et al.  Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Kyu J. Han,et al.  A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[7]  Jiangyan Yi,et al.  End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition , 2021, Interspeech.

[8]  Jun Du,et al.  Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker , 2021, Interspeech.

[9]  Tie-Yan Liu,et al.  FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition , 2021, NeurIPS.

[10]  Kenneth Ward Church,et al.  The Third DIHARD Diarization Challenge , 2020, Interspeech.

[11]  Luk'avs Burget,et al.  Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..

[12]  Shinji Watanabe,et al.  End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors , 2020, INTERSPEECH.

[13]  Aleksei Romanenko,et al.  Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario , 2020, INTERSPEECH.

[14]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[15]  Shuai Wang,et al.  But System for the Second Dihard Speech Diarization Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Shinji Watanabe,et al.  End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification , 2020, ArXiv.

[17]  Shiliang Zhang,et al.  Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition , 2019, INTERSPEECH.

[18]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[22]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[27]  John S. Garofolo,et al.  The Rich Transcription 2006 Spring Meeting Recognition Evaluation , 2006, Machine Learning for Multimodal Interaction.