Speaker Identification and Verification by Combining MFCC and Phase Information

In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of π-\mathtildeθ1 and \mathtildeθ2=-π+\mathtildeθ1 is 2π-2\mathtildeθ1 . If \mathtildeθ1 ≈ 0, then the difference ≈ 2π, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We also conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained.

[1]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Longbiao Wang,et al.  Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM , 2007, Speech Commun..

[3]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[4]  Seiichi Nakagawa,et al.  An unsupervised speaker adaptation method for continuous parameter HMM by maximum a posteriori probability estimation , 1994, ICSLP.

[5]  Wei Zhang,et al.  Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Eliathamby Ambikairajah,et al.  LS regularization of group delay features for speaker recognition , 2009, INTERSPEECH.

[7]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[8]  Longbiao Wang,et al.  Speaker identification by combining MFCC and phase information in noisy environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[10]  Chiyomi Miyajima,et al.  Speaker identification using Gaussian mixture models based on multi-space probability distribution , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[12]  ZhangWei,et al.  Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM , 2006 .

[13]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[14]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[15]  Longbiao Wang,et al.  Speaker recognition by combining MFCC and phase information , 2010, INTERSPEECH.

[16]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Parham Aarabi,et al.  On the importance of phase in human speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Tomoko Matsui,et al.  Comparative Study of Speaker Identification Methods: dPLRM, SVM and GMM , 2006, IEICE Trans. Inf. Syst..

[19]  Wei Zhang,et al.  Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM , 2006, IEICE Trans. Inf. Syst..

[20]  Yoshihiko Nankaku,et al.  Parameter Sharing in Mixture of Factor Analyzers for Speaker Identification , 2005, IEICE Trans. Inf. Syst..

[21]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[22]  Nengheng Zheng,et al.  Integration of Complementary Acoustic Features for Speaker Recognition , 2007, IEEE Signal Processing Letters.

[23]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[24]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[25]  Seiichi Nakagawa,et al.  Text-independent speaker recognition using non-linear frame likelihood transformation , 1998, Speech Commun..

[26]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[27]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[28]  Alvin F. Martin,et al.  The NIST speaker recognition evaluation program , 2005 .

[29]  Günther Palm,et al.  Effects of phase on the perception of intervocalic stop consonants , 1997, Speech Commun..

[30]  Rajesh M. Hegde,et al.  Application of the modified group delay function to speaker identification and discrimination , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Konstantin Markov,et al.  Integrating pitch and LPC-residual information with LPC-cepstrum for text-independent speaker recognition , 1999 .

[32]  Masafumi Nishida,et al.  Speaker recognition by separating phonetic space and speaker space , 2001, INTERSPEECH.

[33]  Kuldip K. Paliwal,et al.  Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[34]  Guangji Shi,et al.  Phased-Based Speech Processing , 2005 .

[35]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[36]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.