Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

Speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition, and speech synthesis have advanced significantly by the emergence of deep learning techniques. However, none of these voice-enabled systems perform well in natural environmental circumstances, specifically in situations where one or more potential interfering talkers are involved. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datsets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The proposed architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference. We also investigate the effect of alternate input features including spectral magnitude, MFCC, MFB, and pyknogram on both computational time and classification performance. Evaluation is performed on simulated overlapping speech signals based on the GRID corpus. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and 92.8% Fscore on same gender overlapped speech. For opposite gender cases, the network scores exceed 95% in all the classification metrics.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  William Chan,et al.  Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments , 2016, INTERSPEECH.

[3]  Carlos Segura,et al.  Overlap detection for speaker diarization by fusing spectral and spatial features , 2010, INTERSPEECH.

[4]  Petros Maragos,et al.  AM-FM energy detection and separation in noise using multiband energy operators , 1993, IEEE Trans. Signal Process..

[5]  John H. L. Hansen,et al.  Methods for stress classification: nonlinear TEO and linear speech based features , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  John H. L. Hansen,et al.  Robust overlapped speech detection and its application in word-count estimation for Prof-Life-Log data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jun Du,et al.  Speech Separation based on signal-noise-dependent deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  M. A. Bee,et al.  The cocktail party problem: what is it? How can it be solved? And why should animal behaviorists study it? , 2008, Journal of comparative psychology.

[9]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[10]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[11]  Mohammad Hassan Savoji,et al.  Supervised speech enhancement using online Group-Sparse Convolutive NMF , 2016, 2016 8th International Symposium on Telecommunications (IST).

[12]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[13]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[14]  Björn W. Schuller,et al.  Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  D. Oberfeld,et al.  Individual differences in selective attention predict speech identification at a cocktail party , 2016, eLife.

[17]  A. Treisman Contextual Cues in Selective Listening , 1960 .

[18]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[19]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[20]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[21]  John H.L. Hansen,et al.  Frame-Based Overlapping Speech Detection Using Convolutional Neural Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Valentin Andrei,et al.  Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  J. Deutsch,et al.  Attention: Some theoretical considerations. , 1963 .

[27]  John H. L. Hansen,et al.  Overlapped-speech detection with applications to driver assessment for in-vehicle active safety systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Soheil Khorram,et al.  Probabilistic Permutation Invariant Training for Speech Separation , 2019, INTERSPEECH.

[29]  Gerald Friedland,et al.  Improved Overlapped Speech Handling for Speaker Diarization , 2011, INTERSPEECH.

[30]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  John H. L. Hansen,et al.  Classification of speech under stress based on features derived from the nonlinear Teager energy operator , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[32]  John H. L. Hansen,et al.  Teager–Kaiser Energy Operators for Overlapped Speech Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[34]  John H. L. Hansen,et al.  Assessing Speaker Engagement in 2-Person Debates: Overlap Detection in United States Presidential Debates , 2018, INTERSPEECH.

[35]  Dong Yu,et al.  Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.