Teager–Kaiser Energy Operators for Overlapped Speech Detection

Overlapped speech is referred to a monophonic audio signal in which at least two speakers are present at the same time. In this study, the focus is on distinguishing overlapped from single-speaker speech, i.e., overlapped speech detection. We develop an overlap detection algorithm using an enhanced time-frequency representation, called Pyknogram, estimated directly from the input audio signal. Pyknograms use the Teager–Kaiser energy operator to detect resonant time-frequency units and thereby suppress nonharmonic structures. We show how the resulting Pyknograms provide high separability in terms of detecting the presence of interfering speech. Our proposed unsupervised Pyknogram-based detection results in over <inline-formula><tex-math notation="LaTeX">$30\%$</tex-math></inline-formula> relative improvement in overlap detection error rates across different signal-to-interference ratios (SIR) compared to baseline systems. In addition, a case study is presented where we evaluate speaker verification performance under different overlap conditions using the GRID database and observe that speaker verification equal error rates (EER) vary from <inline-formula><tex-math notation="LaTeX">$2\%$</tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$30\%$</tex-math></inline-formula>, depending on the average SIR values introduced to train and test sets. In order to estimate the reliability of speaker verification scores across different trials, overlap detection results are interpreted as low-level information and <italic>stack</italic>ed alongside verification outputs. The resulting high-dimensional space is passed through a support vector machine classifier to find the separating hyperplane between target and imposter scores. Combining overlap detection scores with speaker verification on average yields <inline-formula><tex-math notation="LaTeX">$20\%$</tex-math></inline-formula> relative decrease in EER. We also provide an upper bound for this approach using existing overlap labels, which yields <inline-formula> <tex-math notation="LaTeX">$23\%$</tex-math></inline-formula> relative improvement.

[1]  John H. L. Hansen,et al.  Probabilistic linear discriminant analysis for robust speaker identification in co-channel speech , 2015, INTERSPEECH.

[2]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[3]  Hervé Bourlard,et al.  Improved overlap speech diarization of meeting recordings using long-term conversational features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Steven M. Kay,et al.  Cochannel speaker separation by harmonic enhancement and suppression , 1997, IEEE Trans. Speech Audio Process..

[5]  Jordi Luque,et al.  Simultaneous Speech Detection With Spatial Features for Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  David Malah,et al.  Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  John H. L. Hansen,et al.  Co-channel speech detection via spectral analysis of frequency modulated sub-bands , 2014, INTERSPEECH.

[8]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[9]  Krzysztof Kryszczuk,et al.  Improving biometric verification with class-independent quality information , 2009 .

[10]  David A. van Leeuwen,et al.  Quality Measure Functions for Calibration of Speaker Recognition Systems in Various Duration Conditions , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  John H. L. Hansen,et al.  Belt Up: Investigating the impact of in-vehicular conversation on driving performance , 2013, 2013 IEEE Intelligent Vehicles Symposium (IV).

[12]  Wen-Liang Hwang,et al.  Multicomponent AM-FM signal separation and demodulation with null space pursuit , 2013, Signal Image Video Process..

[13]  Wei Lin,et al.  A generalization to the Teager-Kaiser energy function and application to resolving two closely-spaced tones , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Björn W. Schuller,et al.  Convolutive Non-Negative Sparse Coding and New Features for Speech Overlap Handling in Speaker Diarization , 2012, INTERSPEECH.

[15]  Stanley J. Wenndt,et al.  Adjacent pitch period comparison (appc) as a usability measure of speech segments under co-channel conditions , 2001 .

[16]  Israel Cohen,et al.  Monaural speech/music source separation using discrete energy separation algorithm , 2010, Signal Process..

[17]  John H. L. Hansen,et al.  Robust overlapped speech detection and its application in word-count estimation for Prof-Life-Log data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[19]  John R. Hershey,et al.  Monaural speech separation and recognition challenge , 2010, Comput. Speech Lang..

[20]  Larry P. Heck,et al.  MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research , 2013 .

[21]  John H. L. Hansen,et al.  CRSS systems for 2012 NIST Speaker Recognition Evaluation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Robert E. Yantorno Co-Channel Speech and Speaker Identification Study , 1998 .

[23]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[24]  Petros Maragos,et al.  Multicomponent AM-FM demodulation via periodicity-based algebraic separation and energy-based demodulation , 2000, IEEE Trans. Commun..

[25]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Gerald Friedland,et al.  Improved Overlapped Speech Handling for Speaker Diarization , 2011, INTERSPEECH.

[27]  Noam Lior Introduction to the World Energy Panel invited keynote papers from the 24th international conference on efficiency, cost, optimization, simulation and environmental impact of energy systems – ECOS 2011, held from 4 to 7 July 2011 in Novi Sad, Serbia☆ , 2012 .

[28]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[29]  Guy J. Brown,et al.  A multi-pitch tracking algorithm for noisy speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  E. Schegloff Overlapping talk and the organization of turn-taking for conversation , 2000, Language in Society.

[31]  Nelson Morgan,et al.  Audio segmentation for meetings speech processing , 2008 .

[32]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Thomas F. Quatieri,et al.  An approach to co-channel talker interference suppression using a sinusoidal model for speech , 1990, IEEE Trans. Acoust. Speech Signal Process..

[34]  DeLiang Wang,et al.  Co-channel speaker identification using usable speech extraction based on multi-pitch tracking , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[35]  Alexandros Potamianos,et al.  Instantaneous Energy Operators: Applications to Speech Processing and Communications 1. Speech Processing Applications 2. Higher-order Energy Operators , 2007 .

[36]  Stanley J. Wenndt,et al.  Spectral autocorrelation ratio as a usability measure of speech segments under co-channel conditions , 2000 .

[37]  Olivier Pietquin,et al.  Single-speaker/multi-speaker co-channel speech classification , 2010, INTERSPEECH.

[38]  John H. L. Hansen,et al.  Overlapped-speech detection with applications to driver assessment for in-vehicle active safety systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Gerald Friedland,et al.  Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[40]  B Y Smolenski,et al.  Usable speech processing: a filterless approach in the presence of interference , 2011, IEEE Circuits and Systems Magazine.

[41]  Haizhou Li,et al.  Meeting Segmentation Using Two-Layer Cascaded Subband Filters , 2006, ISCSLP.

[42]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Hervé Bourlard,et al.  Detecting and labeling speakers on overlapping speech using vector taylor series , 2014, INTERSPEECH.

[44]  John H. L. Hansen,et al.  Prof-Life-Log: Analysis and classification of activities in daily audio streams , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Petros Maragos,et al.  Speech formant frequency and bandwidth tracking using multiband energy demodulation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[46]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[47]  Stanley J. Wenndt,et al.  Use of local kurtosis measure for spotting usable speech segments in co-channel speech , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[48]  Andrzej Drygajlo,et al.  Speaker verification in score-ageing-quality classification space , 2013, Comput. Speech Lang..

[49]  Leon Cohen,et al.  Instantaneous bandwidth for signals and spectrogram , 1990, International Conference on Acoustics, Speech, and Signal Processing.