Impact of noise reduction and spectrum estimation on noise robust speaker identification

Many spectrum estimation methods and speech enhancement algorithms have previously been evaluated for noise-robust speaker identification (SID). However, these techniques have mostly been evaluated over artificially noised, mismatched training tasks with GMM-UBM speaker models. It is therefore unclear whether performance improvements observed with these methods translate to a broader range of noisy SID tasks. This study compares selected spectrum estimation methods from three classes: cochlear filterbanks, alternative time-domain windowing, and linear prediction-based techniques, as well as a set of frequencydomain noise reduction algorithms, across a suite of 8 evaluation tasks. The evaluation tasks are designed to expand upon the limited tasks addressed in past evaluations by exploring three research questions: performance on real noise versus artificial noise, performance on matched training tasks versus mismatched tasks, and performance when paired with an i-vector backend versus a GMM-UBM backend. We find that noise-robust spectrum estimation methods can improve the performance of SID systems over the range of noise tasks evaluated, including real noisy tasks, matched training tasks, and i-vector backends. However, performance on the typical GMM-UBM mismatched artificially noised case did not predict performance on other tasks. Finally, the matched enrollment case is a significantly different problem than the mismatched enrollment case. Index Terms: mismatched condition, noise robustness, robust features, speaker identification, speech enhancement

[1]  Hsiao-Chuan Wang,et al.  Combination of autocorrelation-based features and projection measure technique for speaker identification , 2005, IEEE Trans. Speech Audio Process..

[2]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Nikos Fakotakis,et al.  Text-Independent Speaker Verification for Real Fast-Varying Noisy Environments , 2004, Int. J. Speech Technol..

[4]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[5]  Aaron D. Lawson,et al.  Survey and evaluation of acoustic features for speaker recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Paavo Alku,et al.  Extended weighted linear prediction (XLP) analysis of speech and its application to speaker verification in adverse conditions , 2010, INTERSPEECH.

[7]  Paavo Alku,et al.  Regularization of all-pole models for speaker verification under additive noise , 2012, Odyssey.

[8]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification , 2010, IEEE Signal Processing Letters.

[9]  Andrzej Drygajlo,et al.  Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Ted S. Wada,et al.  Acoustic Model Enhancement: An Adaptation Technique for Speaker Verification Under Noisy Environments , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Tai-Shih Chi,et al.  Spectro-temporal modulation energy based mask for robust speaker identification. , 2012, The Journal of the Acoustical Society of America.

[12]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  John H. L. Hansen,et al.  Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  John H. L. Hansen,et al.  Assessment of single-channel speech enhancement techniques for speaker identification under mismatched conditions , 2010, INTERSPEECH.

[15]  Douglas D. O'Shaughnessy,et al.  On the use of asymmetric-shaped tapers for speaker verification using i-vectors , 2012, Odyssey.

[16]  Liqing Zhang,et al.  Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure , 2008, EURASIP J. Audio Speech Music. Process..

[17]  Haizhou Li,et al.  Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  B.S. Atal,et al.  Automatic recognition of speakers from their voices , 1976, Proceedings of the IEEE.

[19]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[20]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Hynek Hermansky,et al.  Feature extraction using 2-d autoregressive models for speaker recognition , 2012, Odyssey.

[22]  Tomi Kinnunen,et al.  What else is new than the hamming window? robust MFCCs for speaker recognition via multitapering , 2010, INTERSPEECH.

[23]  Bhaskar D. Rao,et al.  Minimum variance distortionless response (MVDR) modeling of voiced speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[25]  James R. Glass,et al.  A channel-blind system for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Javier Ortega-Garcia,et al.  Overview of speech enhancement techniques for automatic speaker recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[27]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[28]  Tomi Kinnunen,et al.  Multitaper Estimation of Frequency-Warped Cepstra With Application to Speaker Verification , 2010, IEEE Signal Processing Letters.

[29]  Paavo Alku,et al.  Comparing spectrum estimators in speaker verification under additive noise degradation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[31]  Richard J. Mammone,et al.  A comparative study of robust linear predictive analysis methods with applications to speaker identification , 1995, IEEE Trans. Speech Audio Process..

[32]  Richard J. Mammone,et al.  Speaker identification based on the use of robust cepstral features obtained from pole-zero transfer functions , 1998, IEEE Trans. Speech Audio Process..

[33]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[34]  The NIST Year 2010 Speaker Recognition Evaluation Plan 1 I NTRODUCTION , 2022 .

[35]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[36]  Qi Li,et al.  Robust speaker identification using an auditory-based feature , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.