Speaker Identification Using Pseudo Pitch Synchronized Phase Information in Voiced Sound

In conventional speaker identification methods based on mel-frequency cepstral coefficients (MFCCs), phase information is ignored. Our recent studies have shown that phase information contains speaker dependent characteristics. We propose a new extraction method to extract pitch synchronous phase information from the voiced section only. Speaker identification experiments were performed using the NTT clean database and JNAS database. Using the new phase extraction method, we obtained a relative reduction in the speaker error rate of approximately 27% and 46%, respectively, for the two databases. We also obtained a relative error reduction of approximately 52% and 42%, respectively, when combining phase information with the MFCC-based method. I. I NTRODUCTION In conventional speaker identification methods based on mel-frequency cepstral coefficients (MFCCs), only the magnitude of the Fourier Transform in time-domain speech frames is used. This means that the phase component is ignored. Of course, MFCCs capture not only speaker-specific vocal tract information, but also vocal source characteristics. Nevertheless, feature parameters extracted from excitation source characteristics are also useful for speaker identification [1], [4], [5], [6], [7], [10]. Almost all of the existing methods are based on Linear Predictive Coding (LPC) analysis. Markov and Nakagawa proposed a Gaussian Mixture Model (GMM) based text-independent speaker identification system that integrates pitch and the LPC residual with the LPC-derived cepstral coefficients [4]. Their experimental results show that using pitch information is the most effective when the correlation between pitch and the cepstral coefficients is taken into consideration. An automatic technique for estimating and modeling the glottal flow derivative source waveform of speech and applying the model parameters to speaker identification was proposed in [5]. The complementary nature of speakerspecific information in the residual phase compared with the information in conventional MFCCs was demonstrated in [6]. The residual phase was derived from speech signals by linear prediction analysis. Zheng et al. proposed a speaker verification system using complementary acoustic features derived from vocal source excitation and the vocal-tract system [7]. A new feature set, called the wavelet octave coefficients of residues (WOCOR), was proposed to capture the spectrotemporal source excitation characteristics embedded in the linear predictive residual signal [7]. Recently, many speaker recognition studies using group delay based phase information have been proposed [8], [9]. Wang et al. proposed phaserelated features for speaker recognition [11]. This type of phase information considers all frequency ranges. We think that phase information is valid for speaker identification, since it captures the features of the source wave. Previously, we proposed a speaker identification system using a combination of MFCCs and phase information [1], [2], directly extracted from the limited bandwidth of the Fourier transform of the speech wave. We also showed that the phase information is effective for speaker identification in clean and noisy environments [1], [2], [3]. However, problems occurred in extracting the phase information because of the influence of the windowing position. Therefore, we propose a new method to extract pitch synchronous phase information in voiced sound only. Using the new extraction method, the speaker identification rate improved by approximately 27% and 46% for the NTT and JNAS databases, respectively. The rest of this paper is organized as follows. Section 2 presents the phase information extraction method, while Section 3 discusses combining the phase and MFCC methods. The experimental setup and results are reported in Section 4, and Section 5 presents our conclusions. II. PHASE INFORMATION EXTRACTION A. Formulas [1], [3] The spectrumS(ω, t) of a signal is obtained by DFT of an input speech signal sequence S(ω, t) = X(ω, t) + jY (ω, t) = √ X2(ω, t) + Y 2(ω, t)× e. (1) However, the phase changes, depending on the clipping position of the input speech even at the same frequency ω. To overcome this problem, the phase of a certain basis frequency ω is kept constant, and the phases of other frequencies are estimated relative to this. For example, by setting the basis frequencyω to π/4, we obtain S′(ω, t) = √ X2(ω, t) + Y 2(ω, t)× e × ej(π4 −θ(ω,t)), (2) whereas for the other frequency ω′ = 2πf ′, the spectrum becomes S′(ω′, t) = √ X2(ω′, t) + Y 2(ω′, t)× e ′,t) × e ω ′ ω ( π 4 −θ(ω,t)) = X̃(ω′, t) + jỸ (ω′, t). (3) APSIPA ASC 2011 Xi’an

[1]  Ning Wang,et al.  Exploitation of phase information for speaker recognition , 2010, INTERSPEECH.

[2]  Thierry Dutoit,et al.  On the potential of glottal signatures for speaker recognition , 2010, INTERSPEECH.

[3]  Longbiao Wang,et al.  Speaker identification by combining MFCC and phase information in noisy environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Eliathamby Ambikairajah,et al.  LS regularization of group delay features for speaker recognition , 2009, INTERSPEECH.

[5]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[6]  Konstantin Markov,et al.  Integrating pitch and LPC-residual information with LPC-cepstrum for text-independent speaker recognition , 1999 .

[7]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[8]  Longbiao Wang,et al.  High improvement of speaker identification and verification by combining MFCC and phase information , 2009, ICASSP.

[9]  Longbiao Wang,et al.  Speaker recognition by combining MFCC and phase information , 2010, INTERSPEECH.

[10]  Nengheng Zheng,et al.  Integration of Complementary Acoustic Features for Speaker Recognition , 2007, IEEE Signal Processing Letters.

[11]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.