论文信息 - Detecting and Counting Overlapping Speakers in Distant Speech Scenarios

Detecting and Counting Overlapping Speakers in Distant Speech Scenarios

We consider the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings. We treat supervised Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD+OSD, and speaker counting as instances of a general Overlapped Speech Detection and Counting (OSDC) task, and we design a Temporal Convolu-tional Network (TCN) based method to address it. We show that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets. In particular our best architecture obtains, for OSD, 29.1% and 25.5% absolute improvement in Average Precision over previous techniques on, respectively, the AMI and CHiME-6 datasets. Furthermore, we find that generalization for joint VAD+OSD improves by using a speaker counting objective rather than a VAD+OSD objective. We also study the effectiveness of forced alignment based labeling and data augmentation, and show that both can improve OSD performance.

[1] Gerald Friedland,et al. Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Valentin Andrei,et al. Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[3] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[4] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5] Emanuel A. P. Habets,et al. Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Horia Cucu,et al. Overlapped Speech Detection and Competing Speaker Counting–‐Humans Versus Deep Learning , 2019, IEEE Journal of Selected Topics in Signal Processing.

[7] Bernd Edler,et al. CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8] Javier Ramírez,et al. Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[9] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Leibny Paola García-Perera,et al. Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Mari Ostendorf,et al. Efficient use of overlap information in speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12] Gerald Friedland,et al. Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[13] Morgan Sonderegger,et al. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[14] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15] Shinji Watanabe,et al. Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[16] Björn W. Schuller,et al. Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[17] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[18] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[20] Neville Ryant,et al. Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Florian Metze,et al. New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[22] Vladlen Koltun,et al. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[23] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[24] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[25] Yifan Gong,et al. Robust automatic speech recognition : a bridge to practical application , 2015 .

[26] Zhou Yu,et al. Enhancement and Analysis of Conversational Speech: JSALT 2017 , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Xin Wang,et al. Speaker detection in the wild: Lessons learned from JSALT 2019 , 2019, Odyssey.

[28] Emmanuel Vincent,et al. Audio Source Separation and Speech Enhancement , 2018 .

[29] Heiga Zen,et al. Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[30] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[31] Marie Kunesová,et al. Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[32] Kenneth Ward Church,et al. The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[33] Jean Carletta,et al. The AMI meeting corpus , 2005 .

[34] Mireia Díez,et al. BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[35] Antonio Miguel,et al. gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.