DNN Filter Bank Cepstral Coefficients for Spoofing Detection

With the development of speech synthesis techniques, automatic speaker verification systems face the serious challenge of spoofing attack. In order to improve the reliability of speaker verification systems, we develop a new filter bank-based cepstral feature, deep neural network (DNN) filter bank cepstral coefficients, to distinguish between natural and spoofed speech. The DNN filter bank is automatically generated by training a filter bank neural network (FBNN) using natural and synthetic speech. By adding restrictions on the training rules, the learned weight matrix of FBNN is band limited and sorted by frequency, similar to the normal filter bank. Unlike the manually designed filter bank, the learned filter bank has different filter shapes in different channels, which can capture the differences between natural and synthetic speech more effectively. The experimental results on the ASVspoof 2015 database show that the Gaussian mixture model maximum-likelihood classifier trained by the new feature performs better than the state-of-the-art linear frequency triangle filter bank cepstral coefficients-based classifier, especially on detecting unknown attacks.

[1]  Hynek Hermansky,et al.  Data Driven Design of Filter Bank for Speech Recognition , 2001, TSD.

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[3]  Chandra Sekhar Seelamantula,et al.  Gammatone wavelet Cepstral Coefficients for robust speech recognition , 2013, 2013 IEEE International Conference of IEEE Region 10 (TENCON 2013).

[4]  Weisi Lin,et al.  A Universal Framework for Salient Object Detection , 2016, IEEE Transactions on Multimedia.

[5]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[6]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Francesc Alías,et al.  Gammatone Cepstral Coefficients: Biologically Inspired Features for Non-Speech Audio Classification , 2012, IEEE Transactions on Multimedia.

[10]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Eduardo Lleida,et al.  Spoofing detection with DNN and one-class SVM for the ASVspoof 2015 challenge , 2015, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Jianjun Lei,et al.  Keyword extraction by entropy difference between the intrinsic and extrinsic mode , 2013 .

[15]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Haizhou Li,et al.  Spoofing detection from a feature representation perspective , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[18]  Jianjun Lei,et al.  Depth Sensation Enhancement for Multiple Virtual View Rendering , 2015, IEEE Transactions on Multimedia.

[19]  Antti Ylä-Jääski,et al.  Utilize Signal Traces from Others? A Crowdsourcing Perspective of Energy Saving in Cellular Data Communication , 2015, IEEE Transactions on Mobile Computing.

[20]  Haizhou Li,et al.  Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[21]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  DeLiang Wang,et al.  An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[24]  Haizhou Li,et al.  Synthetic speech detection using temporal modulation feature , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  Jun Guo,et al.  Effect of multi-condition training and speech enhancement methods on spoofing detection , 2016, 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE).

[27]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[28]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Themos Stafylakis,et al.  Spoofing Detection on the ASVspoof2015 Challenge Corpus Employing Deep Neural Networks , 2016, Odyssey.

[30]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[31]  Haizhou Li,et al.  Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition , 2012, INTERSPEECH.

[32]  Shiva Gholami-Boroujeny,et al.  Neural network-based adaptive noise cancellation for enhancement of speech auditory brainstem responses , 2016, Signal Image Video Process..

[33]  Aleksandr Sizov,et al.  Joint Speaker Verification and Antispoofing in the $i$ -Vector Space , 2015, IEEE Transactions on Information Forensics and Security.

[34]  Tomi Kinnunen,et al.  Integrated Spoofing Countermeasures and Automatic Speaker Verification: An Evaluation on ASVspoof 2015 , 2016, INTERSPEECH.

[35]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Jon Sánchez,et al.  Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information , 2015, IEEE Transactions on Information Forensics and Security.

[37]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[38]  Mats Blomberg,et al.  Vulnerability in speaker verification - a study of technical impostor techniques , 1999, EUROSPEECH.

[39]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[40]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[41]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.