Detecting and Counting Overlapping Speakers in Distant Speech Scenarios

We consider the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings. We treat supervised Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD+OSD, and speaker counting as instances of a general Overlapped Speech Detection and Counting (OSDC) task, and we design a Temporal Convolu-tional Network (TCN) based method to address it. We show that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets. In particular our best architecture obtains, for OSD, 29.1% and 25.5% absolute improvement in Average Precision over previous techniques on, respectively, the AMI and CHiME-6 datasets. Furthermore, we find that generalization for joint VAD+OSD improves by using a speaker counting objective rather than a VAD+OSD objective. We also study the effectiveness of forced alignment based labeling and data augmentation, and show that both can improve OSD performance.

[1]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Valentin Andrei,et al.  Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[3]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Emanuel A. P. Habets,et al.  Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Horia Cucu,et al.  Overlapped Speech Detection and Competing Speaker Counting–‐Humans Versus Deep Learning , 2019, IEEE Journal of Selected Topics in Signal Processing.

[7]  Bernd Edler,et al.  CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[9]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Leibny Paola García-Perera,et al.  Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Mari Ostendorf,et al.  Efficient use of overlap information in speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[12]  Gerald Friedland,et al.  Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[13]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[14]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[17]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[18]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[20]  Neville Ryant,et al.  Leveraging LSTM Models for Overlap Detection in Multi-Party Meetings , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Florian Metze,et al.  New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[22]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[25]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[26]  Zhou Yu,et al.  Enhancement and Analysis of Conversational Speech: JSALT 2017 , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Xin Wang,et al.  Speaker detection in the wild: Lessons learned from JSALT 2019 , 2019, Odyssey.

[28]  Emmanuel Vincent,et al.  Audio Source Separation and Speech Enhancement , 2018 .

[29]  Heiga Zen,et al.  Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[30]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[31]  Marie Kunesová,et al.  Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[32]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[33]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[34]  Mireia Díez,et al.  BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[35]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.