Assessing Speaker Engagement in 2-Person Debates: Overlap Detection in United States Presidential Debates

Co-channel speech recordings typically contain significant amounts of overlap in which the intelligibility and quality of the desired speech is degraded by interference from a competing talker. Convolutive Non-negative Matrix Factorization (CNMF) has been shown to be a successful approach in detecting overlap by extracting specific acoustic basis dimensions for each speaker from an audio stream. While the results of CNMF have been successful, it requires isolated single speech recordings for each speaker to derive their corresponding bases functions/dimensions. In our previous work, The Teager-Kaiser Energy Operator (TEO)-based Pyknogram has been introduced which does not require prior information concerning the speakers. In this study, Pyknogram and CNMF based solutions for overlap detection within audio streams have been examined using the GRID dataset. TEO-based Pyknogram is shown to achieve a relative 8-10% lower Equal Error Rate (EER) compared to CNMF features. Another drawback of CNMF is that its performance drops considerably when dealing with spontaneous speech that has not been considered for extracting bases in the training step. In addition to the experiments on GRID corpus, a secondary evaluation is also performed based on naturalistic audio streams with overlap. Specifically, we collected a real-world audio database of US Presidential debates stemming from the last 12 years that are challenging due to overlap, changing Signal to Interference Ratio (SIR), and environmental noise, etc. Our experiments indicate that TEO-based Pyknogram is well suited for detecting overlap in challenging real world scenarios such as the US presidential debates.

[1]  John H. L. Hansen,et al.  Teager–Kaiser Energy Operators for Overlapped Speech Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[3]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[4]  Gerald Friedland,et al.  Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[5]  Björn W. Schuller,et al.  Convolutive Non-Negative Sparse Coding and New Features for Speech Overlap Handling in Speaker Diarization , 2012, INTERSPEECH.

[6]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Carlos Segura,et al.  Overlap detection for speaker diarization by fusing spectral and spatial features , 2010, INTERSPEECH.

[8]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[10]  Marijn Huijbregts,et al.  The blame game: performance analysis of speaker diarization system components , 2007, INTERSPEECH.

[11]  T. W. Parsons Separation of speech from interfering speech by means of harmonic selection , 1976 .

[12]  Petros Maragos,et al.  AM-FM energy detection and separation in noise using multiband energy operators , 1993, IEEE Trans. Signal Process..

[13]  Gerald Friedland,et al.  Improved Overlapped Speech Handling for Speaker Diarization , 2011, INTERSPEECH.

[14]  John H. L. Hansen,et al.  Robust overlapped speech detection and its application in word-count estimation for Prof-Life-Log data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Brian A. Hanson,et al.  The harmonic magnitude suppression (EMS) technique for intelligibility enhancement in the presence of interfering speech , 1984, ICASSP.

[17]  John H. L. Hansen,et al.  Curriculum Learning Based Approaches for Noise Robust Speaker Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  S. Boll,et al.  Techniques for suppression of an interfering talker in co-channel speech , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..