Multi-resolution time frequency feature and complementary combination for short utterance speaker recognition

A human speaker recognition expert often observes the speech spectrogram in multiple different scales for speaker recognition, especially under the short utterance condition. Inspired by this action, this paper proposes a novel multi-resolution time frequency feature (MRTF) extraction method, which is obtained by performing a 2-Dimensional discrete cosine transform (DCT) in multi-scale on the time frequency spectrogram matrix and then selecting and combining to the final multi-scaled transformed elements. Compared to the traditional Mel-Frequency Cepstral Coefficient (MFCC) feature extraction, the proposed method can make better use of multi-resolution temporal-frequency information. Beyond this, we also proposed three complementary combination strategies of MFCC and MRTF: in feature level, in i-vector level and in score level. Comparing their performance. We found the best results are obtained by combination in i-vector level. In the three NIST 2008 Speaker Recognition Evaluation datasets, the proposed method is the most effective for improving the performance under short utterance than under long utterance. And after the combination, we can achieve an EER of 11.32 % and MinDCF of 0.054 in the 10sec-10sec trials on the male dataset, which is an absolute 3 % improvement of EER than the best reported result in this field.

[1]  Lukás Burget,et al.  Support vector machines and Joint Factor Analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Themos Stafylakis,et al.  PLDA using Gaussian Restricted Boltzmann Machines with application to Speaker Verification , 2012, INTERSPEECH.

[3]  Jia Liu,et al.  Robust speaker recognition in cross-channel condition based on Gaussian mixture model , 2011, Multimedia Tools and Applications.

[4]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[6]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[7]  H. S. Jayanna,et al.  Limited data speaker identification , 2010 .

[8]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Pawan K. Ajmera,et al.  Multiresolution Features Based Polynomial Kernel Discriminant Analysis for Speaker Recognition , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[12]  Sridha Sridharan,et al.  Experiments in SVM-based Speaker Verification Using Short Utterances , 2010, Odyssey.

[13]  Liang He,et al.  Multi-feature combination for speaker recognition , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[14]  Giuseppe Pirlo,et al.  A multi-resolution multi-classifier system for speaker verification , 2012, Expert Syst. J. Knowl. Eng..

[15]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[16]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Liang He,et al.  Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Liang He,et al.  Variant time-frequency cepstral features for speaker recognition , 2010, INTERSPEECH.

[19]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[21]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[22]  Liang He,et al.  Complementary combination in i-vector level for language recognition , 2012, Odyssey.

[23]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.