Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis

In this paper, we extend the work by Ogihara, et al. to discriminate between human and synthetic speech using features based on pitch patterns. As previously demonstrated, significant differences in pitch patterns between human and synthetic speech can be leveraged to classify speech as being human or synthetic in origin. We propose using mean pitch stability, mean pitch stability range, and jitter as features extracted after image analysis of pitch patterns. We have observed that for synthetic speech, these features lie in a small and distinct space as compared to human speech and have modeled them with a multivariate Gaussian distribution. Our classifier is trained using synthetic speech collected from the 2008 and 2011 Blizzard Challenge along with Festival pre-built voices and human speech from the NIST2002 corpus. We evaluate the classifier on a much larger corpus than previously studied using human speech from the Switchboard corpus, synthetic speech from the Resource Management corpus, and synthetic speech generated from Festival trained on the Wall Street Journal corpus. Results show 98% accuracy in correctly classifying human speech and 96% accuracy in correctly classifying synthetic speech. Index Terms: Speaker recognition, Speech synthesis, Security

[1]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Akio Ogihara,et al.  Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[3]  Sen M. Kuo,et al.  Real-time digital signal processing , 2001 .

[4]  Junichi Yamagishi,et al.  Revisiting the security of speaker verification systems against imposture using synthetic speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ibon Saratxaga,et al.  Detection of synthetic speech for the problem of imposture , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[7]  Simon King,et al.  Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  G. Bachur,et al.  1 Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal , 2008 .

[9]  Junichi Yamagishi,et al.  Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech , 2010, Odyssey.

[10]  Keiichi Tokuda,et al.  Imposture against a Speaker Verification System Using Synthetic Speech , 2000 .

[11]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[12]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[13]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.