Voice conversion spoofing detection by exploring artifacts estimates

Automatic speaker verification or voice biometrics is an approach to verify the person’s claimed identity through his/her voice. Voice biometrics finds its application in mobile banking and forensics. With the increased usage of speaker verification systems, studying the spoofing threats to speaker verification systems and building proper countermeasure is gaining attention. Spoofing is a genuine challenge as it leads to increase in the false alarm rate, i.e. an impostor is incorrectly accepted as genuine speaker. To make voice biometrics viable for practical applications there is a need to detect spoofing attack. Voice conversion spoofing is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches. Studies show that voice conversion introduces artifacts in resultant speech, hence, this paper proposes a novel approach to detect voice conversion spoofing attack by estimating artifact estimates from the given speech signal. To obtain artifact estimate from speech signal non-negative matrix factorization based source separation technique is employed. Later, Convolutional Neural Network based binary classifier is built to classify artifact estimates of input speech as natural and synthetic speech. Experiments are conducted on voice conversion challenge 2016 and voice conversion challenge 2018 database. Results show that proposed technique gives excellent performance by detecting wide range of unknown attacks. The proposed systems are compared to state of art spoof detection systems based on Constant Q Cepstrum Coefficients and Linear Frequency Cepstral Coefficients and results show the proposed system give relatively equivalent and/or better performance. Validation results for various noises is studied using NOIZEUS database and results show the efficiency of the proposed system in noisy environments.

[1]  Longbiao Wang,et al.  Feature Mapping of Multiple Beamformed Sources for Robust Overlapping Speech Recognition Using a Microphone Array , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Driss Matrouf,et al.  Artificial impostor voice transformation effects on false acceptance rates , 2007, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  W. L. Woo,et al.  Single-Channel Source Separation Using EMD-Subband Variable Regularized Sparse Features , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Haizhou Li,et al.  A study on replay attack and anti-spoofing for text-dependent speaker verification , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[7]  Tomi Kinnunen,et al.  Real-time speaker identification and verification , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[9]  Yi Hu,et al.  Subjective comparison and evaluation of speech enhancement algorithms , 2007, Speech Commun..

[10]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[11]  R. Kumara Swamy,et al.  Unsupervised Speech Separation Using Statistical, Auditory and Signal Processing Approaches , 2018, 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET).

[12]  K. Shikano,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  R. Kumaraswamy,et al.  Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies , 2015, Int. J. Speech Technol..

[16]  R. Kumaraswamy,et al.  Single-channel speech separation using empirical mode decomposition and multi pitch information with estimation of number of speakers , 2017, Int. J. Speech Technol..

[17]  Nicholas W. D. Evans,et al.  A new speaker verification spoofing countermeasure based on local binary patterns , 2013, INTERSPEECH.

[18]  E. Bryan George,et al.  Co-channel speaker separation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Tomoki Toda,et al.  The Voice Conversion Challenge 2016 , 2016, INTERSPEECH.

[21]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[22]  William M. Campbell,et al.  Speaker Verification Using Support Vector Machines and High-Level Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Bayya Yegnanarayana,et al.  Determining Number of Speakers From Multispeaker Speech Signals Using Excitation Source Information , 2007, IEEE Signal Processing Letters.

[24]  Wai Lok Woo,et al.  Non-sparse approach to underdetermined blind signal estimation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[25]  Prasanna Kumar Mundodu Krishna,et al.  Single Channel speech separation based on empirical mode decomposition and Hilbert Transform , 2017, IET Signal Process..

[26]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[27]  Keikichi Hirose,et al.  Single-Mixture Audio Source Separation by Subspace Decomposition of Hilbert Spectrum , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  DeLiang Wang,et al.  Co-channel speaker identification using usable speech extraction based on multi-pitch tracking , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Hemant A. Patil,et al.  Cochlear Filter and Instantaneous Frequency Based Features for Spoofed Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[30]  Bayya Yegnanarayana,et al.  Determining Mixing Parameters From Multispeaker Data Using Speech-Specific Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Aleksandr Sizov,et al.  Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise , 2016, Speech Commun..

[32]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Tomoki Toda,et al.  Anti-Spoofing for Text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Guy J. Brown,et al.  Separation of speech from interfering sounds based on oscillatory correlation , 1999, IEEE Trans. Neural Networks.

[35]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[37]  Tomi Kinnunen,et al.  Automatic versus human speaker verification: The case of voice mimicry , 2015, Speech Commun..

[38]  Junichi Yamagishi,et al.  The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods , 2018, Odyssey.

[39]  Wai Lok Woo,et al.  Unsupervised Single-Channel Separation of Nonstationary Signals Using Gammatone Filterbank and Itakura–Saito Nonnegative Matrix Two-Dimensional Factorizations , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[40]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.