Detecting Overlapped Speech on Short Timeframes Using Deep Learning

The intent of this work is to demonstrate how deep learning techniques can be successfully used to detect overlapped speech on independent short timeframes. A secondary objective is to provide an understanding on how the duration of the signal frame influences the accuracy of the method. We trained a deep neural network with heterogeneous layers and obtained close to 80% inference accuracy on frames going as low as 25 milliseconds. The proposed system provides higher detection quality than existing work and can predict overlapped speech with up to 3 simultaneous speakers. The method exposes low response latency and does not require a high amount of computing power.

[1]  Valentin Andrei,et al.  Estimating competing speaker count for blind speech source separation , 2015, 2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD).

[2]  John H. L. Hansen,et al.  Robust overlapped speech detection and its application in word-count estimation for Prof-Life-Log data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[4]  G. J. Burghouts,et al.  Automatic Audio-Visual Fusion for Aggression Detection Using Meta-information , 2012, 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance.

[5]  Andrew Rosenberg,et al.  Let me finish: automatic conflict detection using speaker overlap , 2013, INTERSPEECH.

[6]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[7]  Dong Wang,et al.  Speech overlap detection and attribution using convolutive non-negative sparse coding , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[9]  John H. L. Hansen,et al.  Overlapped-speech detection with applications to driver assessment for in-vehicle active safety systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[11]  Björn W. Schuller,et al.  Using linguistic information to detect overlapping speech , 2013, INTERSPEECH.

[12]  Björn W. Schuller,et al.  Detecting overlapping speech with long short-term memory recurrent neural networks , 2013, INTERSPEECH.

[13]  Hervé Bourlard,et al.  Overlapping Speech Detection Using Long-Term Conversational Features for Speaker Diarization in Meeting Room Conversations , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Wei-Ho Tsai,et al.  Speaker Identification in Overlapping Speech , 2010, J. Inf. Sci. Eng..

[15]  Morena Danieli,et al.  Annotating and categorizing competition in overlap speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Valentin Andrei,et al.  Counting competing speakers in a timeframe - human versus computer , 2015, INTERSPEECH.

[17]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Valentin Andrei,et al.  Detecting the number of competing speakers - human selective hearing versus spectrogram distance based estimator , 2014, INTERSPEECH.