论文信息 - Neural Diarization with Non-Autoregressive Intermediate Attractors

Neural Diarization with Non-Autoregressive Intermediate Attractors

End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency between frames. The proposed method generates non-autoregressive intermediate attractors to produce speaker labels at the lower layers and conditions the subsequent layers with these labels. While the proposed model works in a non-autoregressive manner, the speaker labels are refined by referring to the whole sequence of intermediate labels. The experiments with the two-speaker CALLHOME dataset show that the intermediate labels with the proposed non-autoregressive intermediate attractors boost the diarization performance. The proposed method with the deeper network benefits more from the intermediate labels, resulting in better performance and training throughput than EEND-EDA.

Tetsuji Ogawa | Yusuke Kida | Robin Scheibler | Tatsuya Komatsu | Yusuke Fujita

[1] M. Díez,et al. From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization , 2022, INTERSPEECH.

[2] H. Kim,et al. Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Leibny Paola García-Perera,et al. Encoder-Decoder Based Attractors for End-to-End Neural Diarization , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Kyu J. Han,et al. A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[5] Shinji Watanabe,et al. A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6] A. Stolcke,et al. End-to-end Neural Diarization: From Transformer to Conformer , 2021, Interspeech.

[7] Scott Wisdom,et al. End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Tatsuya Komatsu,et al. Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions , 2021, Interspeech.

[9] Shinji Watanabe,et al. Intermediate Loss Regularization for CTC-Based Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Kenneth Ward Church,et al. The Third DIHARD Diarization Challenge , 2020, Interspeech.

[11] Joon Son Chung,et al. Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[12] Shinji Watanabe,et al. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors , 2020, INTERSPEECH.

[13] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[15] Kenneth Ward Church,et al. The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[16] Shinji Watanabe,et al. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[17] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Quan Wang,et al. Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Alan McCree,et al. Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[25] Daniel Garcia-Romero,et al. Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26] James R. Glass,et al. Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27] H. Bourlard,et al. Interpretation of Multiparty Meetings the AMI and Amida Projects , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[28] Xavier Anguera Miró,et al. Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[29] Douglas A. Reynolds,et al. An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30] Elizabeth Shriberg,et al. Overlap in Meetings: ASR Effects and Analysis by Dialog Factors, Speakers, and Collection Site , 2006, MLMI.

[31] Andreas Stolcke,et al. The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..