Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

Speech Synthesis (SS) and Voice Conversion (VC) presents a genuine risk of attacks for Automatic Speaker Verification (ASV) technology. In this paper, we use our recently proposed unsupervised filterbank learning technique using Convolutional Restricted Boltzmann Machine (ConvRBM) as a frontend feature representation. ConvRBM is trained on training subset of ASV spoof 2015 challenge database. Analyzing the filterbank trained on this dataset shows that ConvRBM learned more low-frequency subband filters compared to training on natural speech database such as TIMIT. The spoofing detection experiments were performed using Gaussian Mixture Models (GMM) as a back-end classifier. ConvRBM-based cepstral coefficients (ConvRBM-CC) perform better than hand crafted Mel Frequency Cepstral Coefficients (MFCC). On the evaluation set, ConvRBM-CC features give an absolute reduction of 4.76 % in Equal Error Rate (EER) compared to MFCC features. Specifically, ConvRBM-CC features significantly perform better in both known attacks (1.93 %) and unknown attacks (5.87 %) compared to MFCC features.

[1]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[3]  Hemant A. Patil,et al.  Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech , 2015, INTERSPEECH.

[4]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[5]  Goutam Saha,et al.  Overview of BTAS 2016 speaker anti-spoofing competition , 2016, 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[6]  Themos Stafylakis,et al.  Spoofing Detection on the ASVspoof2015 Challenge Corpus Employing Deep Neural Networks , 2016, Odyssey.

[7]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8]  Hemant A. Patil,et al.  Filterbank learning using Convolutional Restricted Boltzmann Machine for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[10]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[13]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[14]  Kai Yu,et al.  End-to-end spoofing detection with raw waveform CLDNNS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Richard M. Stern,et al.  Hearing Is Believing: Biologically Inspired Methods for Robust Automatic Speech Recognition , 2012, IEEE Signal Processing Magazine.

[17]  Aleksandr Sizov,et al.  Classifiers for synthetic speech detection: a comparison , 2015, INTERSPEECH.

[18]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[19]  Hemant A. Patil,et al.  Cochlear Filter and Instantaneous Frequency Based Features for Spoofed Speech Detection , 2017, IEEE Journal of Selected Topics in Signal Processing.

[20]  Nicholas W. D. Evans,et al.  Re-assessing the threat of replay spoofing attacks against automatic speaker verification , 2014, 2014 International Conference of the Biometrics Special Interest Group (BIOSIG).

[21]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[22]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Bo Chen,et al.  Robust deep feature for spoofing detection - the SJTU system for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[24]  Richard M. Stern,et al.  Features Based on Auditory Physiology and Perception , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[25]  Hemant A. Patil,et al.  Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Hemant A. Patil,et al.  Unsupervised Deep Auditory Model Using Stack of Convolutional RBMs for Speech Recognition , 2016, INTERSPEECH.

[27]  Kai Yu,et al.  Deep features for automatic spoofing detection , 2016, Speech Communication.

[28]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[29]  Aleksandr Sizov,et al.  ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge , 2017, IEEE Journal of Selected Topics in Signal Processing.

[30]  Hardik B Sailor,et al.  Auditory feature representation using convolutional restricted Boltzmann machine and Teager energy operator for speech recognition. , 2017, The Journal of the Acoustical Society of America.