Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition

Voice conversion techniques present a threat to speaker verification systems. To enhance the security of speaker verification systems, We study how to automatically distinguish natural speech and synthetic/converted speech. Motivated by the research on phase spectrum in speech perception, in this study, we propose to use features derived from phase spectrum to detect converted speech. The features are tested under three different training situations of the converted speech detector: a) only Gaussian mixture model (GMM) based converted speech data are available; b) only unit-selection based converted speech data are available; c) no converted speech data are available for training converted speech model. Experiments conducted on the National Institute of Standards and Technology (NIST) 2006 speaker recognition evaluation (SRE) corpus show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[4]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[5]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  I. Saratxaga,et al.  Simple representation of signal phase for harmonic speech models , 2009 .

[8]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Kuldip K. Paliwal,et al.  Further intelligibility results from human listening tests using the short-time phase spectrum , 2006, Speech Commun..

[10]  Keiichi Tokuda,et al.  On the security of HMM-based speaker verification systems against imposture using synthetic speech , 1999, EUROSPEECH.

[11]  Keiichi Tokuda,et al.  A robust speaker verification system against imposture using an HMM-based speech synthesis system , 2001, INTERSPEECH.

[12]  Tanja Schultz,et al.  Is voice transformation a threat to speaker identification? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[14]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15]  Junichi Yamagishi,et al.  Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech , 2010, Odyssey.

[16]  Keiichi Tokuda,et al.  Imposture using synthetic speech against speaker verification based on spectrum and pitch , 2000, INTERSPEECH.

[17]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Driss Matrouf,et al.  Artificial impostor voice transformation effects on false acceptance rates , 2007, INTERSPEECH.

[19]  Tanja Schultz,et al.  Voice convergin: Speaker de-identification by voice transformation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.