Human vs machine spoofing detection on wideband and narrowband data

How well do humans detect spoofing attacks directed at automatic speaker verification systems? This paper investigates the performance of humans at detecting spoofing attacks from speech synthesis and voice conversion systems. Two speaker verification tasks, in which the speakers were either humans or machines, were also conducted. The three tasks were carried out with two types of data: wideband (16kHz) and narrowband (8kHz) telephone line simulated data. Spoofing detection by humans was compared to automatic spoofing detection (ASD) algorithms. Listening tests were carefully constructed to ensure the human and automatic tasks were as similar as possible taking into consideration listener’s constraints (e.g., fatigue and memory limitations). Results for human trials show the error rates on narrowband data double compared to on wideband data. The second verification task, which included only artificial speech, showed equal overall acceptance rates for both 8kHz and 16kHz. In the spoofing detection task, there was a drop in performance on most of the artificial trials as well as on human trials. At 8kHz, 20% of human trials were incorrectly classified as artificial, compared to 12% at 16kHz. The ASD algorithms also showed a drop in performance on 8kHz data, but outperformed human listeners across the board.

[1]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Bin Ma,et al.  Approaching human listener accuracy with modern speaker verification , 2010, INTERSPEECH.

[3]  Haizhou Li,et al.  A study on replay attack and anti-spoofing for text-dependent speaker verification , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[4]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[6]  Hagai Aronowitz,et al.  Voice transformation-based spoofing of text-dependent speaker verification systems , 2013, INTERSPEECH.

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[9]  Tomoki Toda,et al.  SAS: A speaker verification spoofing database containing diverse attacks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Daniel Elenius,et al.  A comparison between human perception and a speaker verification system score of a voice imitation. , 2004 .

[11]  Keikichi Hirose,et al.  Effects of Speaker Adaptive Training on Tensor-based Arbitrary Speaker Conversion , 2012, INTERSPEECH.

[12]  Driss Matrouf,et al.  Artificial impostor voice transformation effects on false acceptance rates , 2007, INTERSPEECH.

[13]  Junichi Yamagishi,et al.  Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech , 2010, Odyssey.

[14]  Stanley J. Wenndt,et al.  Machine recognition vs human recognition of voices , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tomi Kinnunen,et al.  I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry , 2013, INTERSPEECH.

[16]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Haizhou Li,et al.  A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[18]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[19]  Tomoki Toda,et al.  Non-parallel training for many-to-many eigenvoice conversion , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  M. Cooke,et al.  Consonant identification in noise by native and non-native listeners: effects of local context. , 2008, The Journal of the Acoustical Society of America.

[21]  R. Smits,et al.  Patterns of English phoneme confusions by native and non-native listeners. , 2004, The Journal of the Acoustical Society of America.

[22]  Tomi Kinnunen,et al.  Comparison of human listeners and speaker verification systems using voice mimicry data , 2014, Odyssey.

[23]  Thomas H. Crystal,et al.  Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data , 2000, Digit. Signal Process..

[24]  Simon King,et al.  Spoofing and Anti-Spoofing (SAS) corpus v1.0 , 2015 .

[25]  Zhizheng Wu,et al.  Human vs Machine Spoofing , 2015 .

[26]  Haizhou Li,et al.  Exemplar-based unit selection for voice conversion utilizing temporal information , 2013, INTERSPEECH.

[27]  Hans-Günter Hirsch,et al.  The simulation of realistic acoustic input scenarios for speech recognition systems , 2005, INTERSPEECH.

[28]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[29]  D. Pisoni,et al.  Recognition of spoken words by native and non-native listeners: talker-, listener-, and item-related factors. , 1999, The Journal of the Acoustical Society of America.

[30]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.