Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification

In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.

[1]  Rahim Saeidi,et al.  Particle Swarm Optimization for Sorted Adapted Gaussian Mixture Models , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[3]  Kurt S. Riedel,et al.  Minimum bias multiple taper spectral estimation , 2018, IEEE Trans. Signal Process..

[4]  D.J. Thomson,et al.  Jackknifing Multitaper Spectrum Estimates , 2007, IEEE Signal Processing Magazine.

[5]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[6]  Haizhou Li,et al.  Temporal Structure Normalization of Speech Feature for Robust Speech Recognition , 2007, IEEE Signal Processing Letters.

[7]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[8]  D. Thomson,et al.  Spectrum estimation and harmonic analysis , 1982, Proceedings of the IEEE.

[9]  David A. van Leeuwen,et al.  NIST and NFI-TNO evaluations of automatic speaker recognition , 2006, Comput. Speech Lang..

[10]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[11]  Donald B. Percival,et al.  The variance of multitaper spectrum estimates for real Gaussian processes , 1994, IEEE Trans. Signal Process..

[12]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[13]  Tomi Kinnunen,et al.  Multitaper Estimation of Frequency-Warped Cepstra With Application to Speaker Verification , 2010, IEEE Signal Processing Letters.

[14]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[15]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[16]  Maria Hansson,et al.  Optimal cepstrum estimation using multiple windows , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Douglas D. O'Shaughnessy,et al.  Multi-taper MFCC features for speaker verification using I-vectors , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[18]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Speaker Verification in Additive Noise , 2010, Odyssey.

[19]  Rong Tong,et al.  The I4U system in NIST 2008 speaker recognition evaluation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[21]  Donald B. Percival,et al.  Spectral Analysis for Physical Applications , 1993 .

[22]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Maria Hansson,et al.  A multiple window method for estimation of peaked spectra , 1997, IEEE Trans. Signal Process..

[25]  Tomi Kinnunen,et al.  What else is new than the hamming window? robust MFCCs for speaker recognition via multitapering , 2010, INTERSPEECH.

[26]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[27]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[28]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification , 2010, IEEE Signal Processing Letters.

[29]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[30]  Sridha Sridharan,et al.  Modelling session variability in text-independent speaker verification , 2005, INTERSPEECH.

[31]  Andreas Stolcke,et al.  Speaker Recognition With Session Variability Normalization Based on MLLR Adaptation Transforms , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  J. G. Woodward,et al.  IEEE TRANSACTIONS@ ON AUDIO AND ELECTROACOUSTICS , 1968 .

[34]  Paavo Alku,et al.  Extended weighted linear prediction (XLP) analysis of speech and its application to speaker verification in adverse conditions , 2010, INTERSPEECH.

[35]  William M. Campbell,et al.  Advances in channel compensation for SVM speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[38]  Rainer Martin,et al.  On the Statistics of Spectral Amplitudes After Variance Reduction by Temporal Cepstrum Smoothing and Cepstral Nulling , 2009, IEEE Transactions on Signal Processing.

[39]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[41]  Gordon Ramsay,et al.  Multitaper analysis of fundamental frequency variations during voiced fricatives , 2003 .

[42]  L. P. Ricotti Multitapering and a wavelet variant of MFCC in speech recognition , 2005 .

[43]  P. Welch The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms , 1967 .

[44]  Sven Nordholm,et al.  Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Haizhou Li,et al.  GMM-SVM Kernel With a Bhattacharyya-Based Distance for Speaker Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  N. Erdol,et al.  Multitaper Covariance Estimation and Spectral Denoising , 2005, Conference Record of the Thirty-Ninth Asilomar Conference onSignals, Systems and Computers, 2005..

[47]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[48]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[50]  Yi Hu,et al.  Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.

[51]  Arnold Neumaier,et al.  Algorithm 808: ARfit—a matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models , 2001, TOMS.

[52]  Thomas P. Bronez,et al.  On the performance advantage of multitaper spectral analysis , 1992, IEEE Trans. Signal Process..