论文信息 - Overlapped Speech Detection in Meeting Using Cross-Channel Spectral Subtraction and Spectrum Similarity

Overlapped Speech Detection in Meeting Using Cross-Channel Spectral Subtraction and Spectrum Similarity

We propose an overlapped speech detection method for speech recognition and speaker diarization of meetings, where each speaker wears a lapel microphone. Two novel features are utilized as inputs for a GMM-based detector. One is speech power after cross-channel spectral subtraction which reduces the power from the other speakers. The other is an amplitude spectral cosine correlation coefficient which effectively extracts the correlation of spectral components in a rather quiet condition. We evaluated our method using a meeting speech corpus of four speakers. The accuracy of our proposed method, 74.1%, was significantly better than that of the conventional method, 67.0%, which uses raw speech power and power spectral Pearson’s correlation coefficient.

Koichi Shinoda | Koji Iwano | Yu Nasu | Ryo Yokoyama

[1] Panayiotis G. Georgiou,et al. Overlapped speech detection using long-term spectro-temporal similarity in stereo recording , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Itshak Lapidot,et al. Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[3] Andreas Stolcke,et al. Leveraging speaker diarization for meeting recognition from distant microphones , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[5] Koichi Shinoda,et al. Cross-Channel Spectral Subtraction for meeting speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Guy J. Brown,et al. Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[7] Fabio Valente,et al. Speaker diarization of meetings based on speaker role n-gram models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).