Cepstral mean based speech source discrimination

This paper presents and compares methods for discrimination between speech from a broadcast audio device - like a television, radio, or GPS receiver - and live speech in the same acoustic environment. A solution to this discrimination problem has direct application wherever the audio from such a device interferes with voice recognition, verification, or transcription tasks. The methods and theory applied also have potential applications in multimedia and speaker segmentation, as well as in speaker verification. This paper presents a new use of the cepstral mean as an estimator of the linear time-invariant response of a “speaker” - either broadcast or live - over a relatively long time window. The problem is framed in terms of traditional speaker verification, but with two classes of speakers. This method is tested on five different data sets and the results compared for different feature sets, training methods, and window lengths.

[1]  Nima Mesgarani,et al.  Speech discrimination based on multiscale spectro-temporal modulations , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  Stephen A. Dyer,et al.  Digital signal processing , 2018, 8th International Multitopic Conference, 2004. Proceedings of INMIC 2004..

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[6]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  M. Stevenson,et al.  A playback attack detector for speaker verification systems , 2008, 2008 3rd International Symposium on Communications, Control and Signal Processing.

[8]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[9]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[10]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.