On the Use of Complementary Spectral Features for Speaker Recognition

The most popular features for speaker recognition are Mel frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs). These features are used extensively because they characterize the vocal tract configuration which is known to be highly speaker-dependent. In this work, several features are introduced that can characterize the vocal system in order to complement the traditional features and produce better speaker recognition models. The spectral centroid (SC), spectral bandwidth (SBW), spectral band energy (SBE), spectral crest factor (SCF), spectral flatness measure (SFM), Shannon entropy (SE), and Renyi entropy (RE) were utilized for this purpose. This work demonstrates that these features are robust in noisy conditions by simulating some common distortions that are found in the speakers' environment and a typical telephone channel. Babble noise, additive white Gaussian noise (AWGN), and a bandpass channel with 1 dB of ripple were used to simulate these noisy conditions. The results show significant improvements in classification performance for all noise conditions when these features were used to complement the MFCC and MFCC features. In particular, the SC and SCF improved performance in almost all noise conditions within the examined SNR range (10–40 dB). For example, in cases where there was only one source of distortion, classification improvements of up to 8% and 10% were achieved under babble noise and AWGN, respectively, using the SCF feature.

[1]  Kuldip K. Paliwal,et al.  Spectral subband centroid features for speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[5]  Douglas D. O'Shaughnessy,et al.  Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition , 1999, IEEE Trans. Speech Audio Process..

[6]  J. Ortega-Garcia,et al.  Biometric identification through speaker verification over telephone lines , 1999, Proceedings IEEE 33rd Annual 1999 International Carnahan Conference on Security Technology (Cat. No.99CH36303).

[7]  D.G. Childers,et al.  Measuring and modeling vocal source-tract interaction , 1994, IEEE Transactions on Biomedical Engineering.

[8]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[9]  William J. Williams,et al.  Information bounds for random signals in time-frequency plane , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  J.M. Naik,et al.  Speaker verification: a tutorial , 1990, IEEE Communications Magazine.

[11]  Hua Ouyang,et al.  Use of Vocal Source Features in Speaker Segmentation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Alvin F. Martin,et al.  The NIST 1999 Speaker Recognition Evaluation - An Overview , 2000, Digit. Signal Process..

[13]  Nengheng Zheng,et al.  Using Haar transformed vocal source information for automatic speaker recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  M. Faundez-Zanuy,et al.  State-of-the-art in speaker recognition , 2005, IEEE Aerospace and Electronic Systems Magazine.

[15]  A. C. Kot,et al.  Speaker verification for home security system , 1997, ISCE '97. Proceedings of 1997 IEEE International Symposium on Consumer Electronics (Cat. No.97TH8348).

[16]  Andrew Beng Jin Teoh,et al.  An Internet based speech biometric verification system , 2003, 9th Asia-Pacific Conference on Communications (IEEE Cat. No.03EX732).

[17]  Danoush Hosseinzadeh,et al.  Combining Vocal Source and MFCC Features for Enhanced Speaker Recognition Performance Using GMMs , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[18]  Roland Auckenthaler,et al.  Improving a GMM speaker verification system by phonetic weighting , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[19]  Jirí Navrátil,et al.  Depitch and the role of fundamental frequency in speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[20]  Olivier J. J. Michel,et al.  Time-frequency complexity and information , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[23]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..