Frame-Based Overlapping Speech Detection Using Convolutional Neural Networks

Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlapping speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using different spectral features, and show that pyknogram features outperforms other commonly used speech features. The proposed system can predict overlapping speech with an accuracy of 84% and Fs-core of 88% on a dataset of mixed speech generated based on the GRID dataset.

[1]  Soheil Khorram,et al.  Domain Expansion in DNN-Based Acoustic Models for Robust Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[3]  Gerald Friedland,et al.  Where did I go wrong?: Identifying troublesome segments for speaker diarization systems , 2012, INTERSPEECH.

[4]  Soheil Khorram,et al.  Probabilistic Permutation Invariant Training for Speech Separation , 2019, INTERSPEECH.

[5]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[6]  Jun Du,et al.  Speech Separation based on signal-noise-dependent deep neural networks for robust speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Björn W. Schuller,et al.  Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[8]  Deliang Wang,et al.  On Spatial Features for Supervised Speech Separation and its Application to Beamforming and Robust ASR , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  John H. L. Hansen,et al.  Teager–Kaiser Energy Operators for Overlapped Speech Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[11]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[13]  John H. L. Hansen,et al.  Advancing Multi-Accented Lstm-CTC Speech Recognition Using a Domain Specific Student-Teacher Learning Paradigm , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[14]  Soheil Khorram,et al.  Analyzing Large Receptive Field Convolutional Networks for Distant Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Valentin Andrei,et al.  Detecting Overlapped Speech on Short Timeframes Using Deep Learning , 2017, INTERSPEECH.

[16]  John H. L. Hansen,et al.  Assessing Speaker Engagement in 2-Person Debates: Overlap Detection in United States Presidential Debates , 2018, INTERSPEECH.

[17]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Patrick Paroubek,et al.  Annotation and analysis of overlapping speech in political interviews , 2008, LREC.

[19]  Jinyu Li,et al.  Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[21]  Mohammad Hassan Savoji,et al.  Supervised speech enhancement using online Group-Sparse Convolutive NMF , 2016, 2016 8th International Symposium on Telecommunications (IST).

[22]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  B Y Smolenski,et al.  Usable speech processing: a filterless approach in the presence of interference , 2011, IEEE Circuits and Systems Magazine.

[24]  John R. Hershey,et al.  Single-Channel Multitalker Speech Recognition , 2010, IEEE Signal Processing Magazine.

[25]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.