Effective use of DCTS for contextualizing features for speaker recognition

This article proposes a new approach for contextualizing features for speaker recognition through the discrete cosine transform (DCT). Specifically, we apply a 2D-DCT transform on the Mel filterbank outputs to replace the common Mel frequency cepstral coefficients (MFCCs) appended by deltas and double deltas. A thorough comparison of algorithms for delta computation and DCT-based contextualization for speaker recognition is provided and the effect of varying the size of analysis window in each case is considered. Selection of 2D-DCT coefficients using a zig-zag approach permits definition of an arbitrary feature dimension using the most energized coefficients. We show that 60 coefficients computed using our approach outperforms the standard MFCCs appended with double deltas by up to 25% relative on the NIST 2012 speaker recognition evaluation (SRE) corpus in both Cprimary and equal error rate (EER) while additional coefficients increase system robustness to noise.

[1]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[2]  Climent Nadeu,et al.  Time and frequency filtering of filter-bank energies for robust HMM speech recognition , 2000, Speech Commun..

[3]  The NIST Year 2012 Speaker Recognition Evaluation Plan 1 I , 2022 .

[4]  L. Burget,et al.  Promoting robustness for speaker modeling in the community: the PRISM evaluation set , 2011 .

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yun Lei,et al.  Improving language identification robustness to highly channel-degraded speech through multiple system fusion , 2013, INTERSPEECH.

[7]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Joan L. Mitchell,et al.  JPEG: Still Image Data Compression Standard , 1992 .

[9]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Pietro Laface,et al.  Language Identification using Acoustic Models and Speaker Compensated Cepstral-Time Matrices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Pietro Laface,et al.  Loquendo - Politecnico di Torino's 2010 NIST speaker recognition evaluation system , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Monson H. Hayes,et al.  Hidden Markov models for face recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[13]  Patrick Kenny,et al.  Mixture of PLDA Models in i-vector Space for Gender-Independent Speaker Recognition , 2011, INTERSPEECH.

[14]  Sébastien Marcel,et al.  Cross-Pollination of Normalization Techniques From Speaker to Face Authentication Using Gaussian Mixture Models , 2012, IEEE Transactions on Information Forensics and Security.

[15]  Yun Lei,et al.  A noise-robust system for NIST 2012 speaker recognition evaluation , 2013, INTERSPEECH.

[16]  Abeer Alwan,et al.  An efficient and scalable 2D DCT-based feature coding scheme for remote speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Pietro Laface,et al.  Loquendo - Politecnico di Torino's 2008 NIST speaker recognition evaluation system , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yun Lei,et al.  All for one: feature combination for highly channel-degraded speech activity detection , 2013, INTERSPEECH.