Relative phase information for detecting human speech and spoofed speech

The detection of human and spoofed (synthetic/converted) speech has started to receive more attention. In this study, relative phase information extracted from a Fourier spectrum is used to detect human and spoofed speech. Because original/natural phase information is almost entirely lost in spoofed speech using current synthesis/conversion techniques, a modified group delay based feature, the frequency derivative of the phase spectrum, has been shown effective for detecting human speech and spoofed speech. The modified group delay based phase contains both the magnitude spectrum and phase information. Therefore, the relative phase information, which contains only phase information, is expected to achieve a better spoofing detection performance. In this study, the relative phase information is also combined with the Mel-Frequency Cepstral Coefficient (MFCC) and modified group delay. The proposed method was evaluated using the “ASVspoof 2015: Automatic Speaker Verification Spoofing and Countermeasures Challenge” dataset. The results show that the proposed relative phase information significantly outperforms the MFCC and modified group delay. The equal error rate (EER) was reduced from 1.74% of MFCC, 0.83% of modified group delay to 0.013% of relative phase. By combining the relative phase with MFCC and modified group delay, the EER was reduced to 0.002%. Index Terms: Spoofing detection, relative phase information, group delay, GMM, countermeasures

[1]  Sree Hari Krishnan Parthasarathi,et al.  Robustness of phase based features for speaker recognition , 2009, INTERSPEECH.

[2]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[4]  Rajesh M. Hegde,et al.  Application of the modified group delay function to speaker identification and discrimination , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[6]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Longbiao Wang,et al.  Speaker identification by combining MFCC and phase information in noisy environments , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tanja Schultz,et al.  Is voice transformation a threat to speaker identification? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Longbiao Wang,et al.  Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM , 2007, Speech Commun..

[10]  Eliathamby Ambikairajah,et al.  LS regularization of group delay features for speaker recognition , 2009, INTERSPEECH.

[11]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Longbiao Wang,et al.  Speaker recognition by combining MFCC and phase information , 2010, INTERSPEECH.

[13]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[15]  Longbiao Wang,et al.  High improvement of speaker identification and verification by combining MFCC and phase information , 2009, ICASSP.

[16]  Longbiao Wang,et al.  Speaker identification using pseudo pitch synchronized phase information in noisy environments , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[17]  Longbiao Wang,et al.  PLDA in the i-supervector space for text-independent speaker verification , 2014, EURASIP J. Audio Speech Music. Process..

[18]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[19]  Longbiao Wang,et al.  Speaker Identification and Verification by Combining MFCC and Phase Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[21]  Keiichi Tokuda,et al.  Imposture using synthetic speech against speaker verification based on spectrum and pitch , 2000, INTERSPEECH.

[22]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[23]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Longbiao Wang,et al.  Speaker Identification by Combining Various Vocal Tract and Vocal Source Features , 2014, TSD.