Linear versus mel frequency cepstral coefficients for speaker recognition

Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-the-art speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.

[1]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  John H. L. Hansen,et al.  Speaker identification with whispered speech based on modified LFCC parameters and feature mapping , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Roland Auckenthalery,et al.  WARPING FUNCTION FOR SUB-BAND ERROR EQUALISATION IN SPEAKER RECOGNITION , 2007 .

[4]  Colleen Richey,et al.  Effects of vocal effort and speaking style on text-independent speaker verification , 2008, INTERSPEECH.

[5]  Daniel Garcia-Romero,et al.  Joint Factor Analysis for Speaker Recognition Reinterpreted as Signal Coding Using Overcomplete Dictionaries , 2010, Odyssey.

[6]  Jianwu Dang,et al.  Physiological Feature Extraction for Text Independent Speaker Identification using Non-Uniform Subband Processing , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[8]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[9]  Brad H. Story,et al.  USING IMAGING AND MODELING TECHNIQUES TO UNDERSTAND THE RELATION BETWEEN VOCAL TRACT SHAPE TO ACOUSTIC CHARACTERISTICS , 2003 .

[10]  Aaron D. Lawson,et al.  Survey and evaluation of acoustic features for speaker recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Eduardo López Gonzalo,et al.  Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition , 2009, INTERSPEECH.

[12]  Coarticulation • Suprasegmentals,et al.  Acoustic Phonetics , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[13]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[14]  Keiichi Tokuda,et al.  A new approach to designing a feature extractor in speaker identification based on discriminative feature extraction , 2001, Speech Commun..

[15]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.