Robust speaker recognition using spectro-temporal autoregressive models

Speaker recognition in noisy environments is challenging when there is a mis-match in the data used for enrollment and verification. In this paper, we propose a robust feature extraction scheme based on spectro-temporal modulation filtering using two-dimensional (2-D) autoregressive (AR) models. The first step is the AR modeling of the sub-band temporal envelopes by the application of the linear prediction on the sub-band discrete cosine transform (DCT) components. These sub-band envelopes are stacked together and used for a second AR modeling step. The spectral envelope across the sub-bands is approximated in this AR model and cepstral features are derived which are used for speaker recognition. The use of AR models emphasizes the focus on the high energy regions which are relatively well preserved in the presence of noise. The degree of modulation filtering is controlled using AR model order parameter. Experiments are performed using noisy versions of NIST 2010 speaker recognition evaluation (SRE) data with a stateof-art speaker recognition system. In these experiments, the proposed features provide significant improvements compared to baseline features (relative improvements of 20% in terms of equal error rate (EER) and 35 % in terms of miss rate at 10 % false alarm).

[1]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[2]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[3]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[4]  Aaron E. Rosenberg,et al.  Cepstral channel normalization techniques for HMM-based speaker verification , 1994, ICSLP.

[5]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[6]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[7]  Phil D. Green,et al.  Missing data techniques for robust speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  R. Kumaresan,et al.  Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications , 1999 .

[9]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[10]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[11]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[12]  Daniel P. W. Ellis,et al.  PLP2: Autoregressive modeling of auditory-like 2-D spectro-temporal patterns , 2004 .

[13]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[14]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[15]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Mohamed Kamal Omar,et al.  Feature normalization for speaker verification in room reverberation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[19]  Hynek Hermansky,et al.  Feature extraction using 2-d autoregressive models for speaker recognition , 2012, Odyssey.

[20]  Sridhar Krishna Nemala,et al.  A Multistream Feature Framework Based on Bandpass Modulation Filtering for Robust Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.