Deep Feature Engineering for Noise Robust Spoofing Detection

Spoofing detection for automatic speaker verification (ASV) aims to discriminate between genuine and spoofed speech. This topic has received increased attentions recently due to safety concerns with deploying an ASV system. While the performance of spoofing detection has improved significantly in clean condition in recent studies, the performance degrades dramatically in noisy conditions. To address this issue, in this paper, we propose to extract robust and discriminative deep features by using deep learning techniques for spoofing detection. In particular, we employ deep feedforward, recurrent, and convolutional neural networks to extract discriminative features. We also introduce multicondition training, noise-aware training, and annealed dropout training to make neural networks more robust against noise and to avoid overfitting to specific spoofing attacks and noise types. The proposed neural networks and training techniques are combined into a single framework for spoofing detection. Experimental evaluation is carried out on a noisy version of the standard ASVspoof 2015 corpus, including both additive noisy and reverberant scenarios. Experimental results confirm that the proposed system dramatically decreases averaged equal error rates from 19.1% and 22.6% to 3.2% and 5.1% for seen and unseen noisy conditions, respectively.

[1]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[2]  Souvik Kundu,et al.  Speaker-aware training of LSTM-RNNS for acoustic modelling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[4]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[5]  Akio Ogihara,et al.  Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[6]  Eduardo Lleida,et al.  Preventing replay attacks on speaker verification systems , 2011, 2011 Carnahan Conference on Security Technology.

[7]  Nicholas W. D. Evans,et al.  Spoofing countermeasures for the protection of automatic speaker recognition systems against attacks with artificial signals , 2012, INTERSPEECH.

[8]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[9]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[11]  Hemant A. Patil,et al.  Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech , 2015, INTERSPEECH.

[12]  Aleksandr Sizov,et al.  Spoofing detection goes noisy: An analysis of synthetic speech detection in the presence of additive noise , 2016, Speech Commun..

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[15]  David A. van Leeuwen,et al.  Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[18]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Vaibhava Goel,et al.  Annealed dropout training of deep networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[21]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[22]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[24]  Themos Stafylakis,et al.  Spoofing Detection on the ASVspoof2015 Challenge Corpus Employing Deep Neural Networks , 2016, Odyssey.

[25]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Keiichi Tokuda,et al.  A robust speaker verification system against imposture using an HMM-based speech synthesis system , 2001, INTERSPEECH.

[28]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[29]  Haizhou Li,et al.  A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[30]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[31]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Qin Jin,et al.  Application of LDA to speaker recognition , 2000, INTERSPEECH.

[34]  Haizhou Li,et al.  An Investigation of Spoofing Speech Detection Under Additive Noise and Reverberant Conditions , 2016, INTERSPEECH.

[35]  Kai Yu,et al.  Very deep convolutional neural networks for LVCSR , 2015, INTERSPEECH.

[36]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[37]  Kai Yu,et al.  A novel dynamic parameters calculation approach for model compensation , 2014, INTERSPEECH.

[38]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Florian Metze,et al.  Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[40]  Eduardo Lleida,et al.  Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge , 2015, INTERSPEECH.

[41]  Gang Wei,et al.  Channel pattern noise based playback attack detection algorithm for speaker recognition , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[42]  Yi Liu,et al.  Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing , 2015, INTERSPEECH.

[43]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[44]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[46]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Dong Yu,et al.  Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[50]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[51]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[52]  Kai Yu,et al.  Multi-task learning for text-dependent speaker verification , 2015, INTERSPEECH.

[53]  Li-Rong Dai,et al.  Speaker verification against synthetic speech , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[54]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[55]  Liu Gang,et al.  Joint information from nonlinear and linear features for spoofing detection: An i-vector/DNN based approach , 2016 .

[56]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[57]  Eduardo Lleida,et al.  Detecting Replay Attacks from Far-Field Recordings on Speaker Verification Systems , 2011, BIOID.

[58]  Junichi Yamagishi,et al.  Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis , 2012, INTERSPEECH.

[59]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[60]  Tara N. Sainath,et al.  Locally-connected and convolutional neural networks for small footprint speaker recognition , 2015, INTERSPEECH.

[61]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[62]  Yu Zhang,et al.  Integrated adaptation with multi-factor joint-learning for far-field speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).