Improving the potential of Enhanced Teager Energy Cepstral Coefficients (ETECC) for replay attack detection

Abstract In the scope of voice biometrics, the term replay attack, (RA) refers to the dishonest attempt made by an impostor to spoof someone else’s identity by replaying the subject’s previously recorded speech close to the Automatic Speaker Verification (ASV) system under attack. State-of-the-art strategies for RA detection, such as the Enhanced Teager Energy Cepstral Coefficients (ETECC), have shown promising results due to their precision in measuring energy from high frequency components of speech, as a function of two recently defined concepts: signal mass and Enhanced Teager Energy Operator (ETEO). Nevertheless, since the replay mechanism prominently deteriorates the speech signal spectrum just in those spectral zones, we propose the association of ETEO with different strategies to further improve the previous results in getting effective countermeasures against RAs. Specifically, comprehensive evaluations which include a detailed mathematical analysis, a simulation on amplitude and frequency modulated (AM-FM) signals, and a spectrographic inspection involving different filterbank structures, along with their experimental results, are provided in this paper. In addition, ETEO-derived features are contrasted to existing feature sets by using Paraconsistent Feature Engineering (PFE) for feature ranking, expanding our previously published results. Lastly, experiments are performed with ASVSpoof-2017 version 2.0 dataset, Realistic Replay Attack Microphone Array Speech Corpus (ReMASC), BTAS-2016, dataset, ASVSpoof-2019 challenge dataset, and ASVSpoof-2015 challenge dataset, considering Gaussian Mixture Models (GMMs), Convolutional Neural Networks (CNNs) and Light-CNN architectures as being the classifiers. The standalone ETECC-GMM system showed the best performance by producing equal error rates (EERs) of 5.55% and 10.75% on development and evaluation sets, respectively.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Anil Kumar Vuppala,et al.  IIIT-H Spoofing Countermeasures for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2019 , 2019, INTERSPEECH.

[3]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[4]  John Eargle In-Line, Planar Loudspeakers, and Arrays , 2003 .

[5]  Petros Maragos,et al.  Speech nonlinearities, modulations, and energy operators , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[6]  G. Ekman Weber's Law and Related Functions , 1959 .

[7]  Ming Li,et al.  Countermeasures for Automatic Speaker Verification Replay Spoofing Attack : On Data Augmentation, Feature Representation, Classification and Fusion , 2017, INTERSPEECH.

[8]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[9]  Douglas A. Reynolds,et al.  Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Goutam Saha,et al.  Overview of BTAS 2016 speaker anti-spoofing competition , 2016, 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[11]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[12]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[13]  Sébastien Le Maguer,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[14]  Vassilis Digalakis,et al.  Speech Emotion Recognition using non-linear Teager energy based features in noisy environments , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[15]  Rohan Kumar Das,et al.  Countermeasure to handle replay attacks in practical speaker verification systems , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).

[16]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[17]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[18]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[19]  Ankur T. Patil,et al.  Significance of CMVN for Replay Spoof Detection , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[20]  Sébastien Marcel,et al.  On the vulnerability of speaker verification to realistic voice spoofing , 2015, 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS).

[21]  A. Enis Çetin,et al.  Teager energy based feature parameters for speech recognition in car noise , 1999, IEEE Signal Processing Letters.

[22]  S. Mallat A wavelet tour of signal processing , 1998 .

[23]  Madhu R. Kamble,et al.  Analysis of Reverberation via Teager Energy Features for Replay Spoof Speech Detection , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Kong-Aik Lee,et al.  RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Nanxin Chen,et al.  ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks , 2019, INTERSPEECH.

[26]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[27]  Nicholas W. D. Evans,et al.  Re-assessing the threat of replay spoofing attacks against automatic speaker verification , 2014, 2014 International Conference of the Biometrics Special Interest Group (BIOSIG).

[28]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  D. Gabor,et al.  Theory of communication. Part 1: The analysis of information , 1946 .

[31]  Qi Li,et al.  An auditory-based transfrom for audio signal processing , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[32]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[33]  Petros Maragos,et al.  On separating amplitude from frequency modulations using energy operators , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[35]  Kong-Aik Lee,et al.  ASVspoof 2017 Version 2.0: meta-data analysis and baseline enhancements , 2018, Odyssey.

[36]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[37]  Dorde T. Grozdic,et al.  Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[39]  Hemant A. Patil,et al.  Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech , 2015, INTERSPEECH.

[40]  Niko Brümmer,et al.  Application-independent evaluation of speaker detection , 2006, Comput. Speech Lang..

[41]  Madhu R. Kamble,et al.  Auditory Filterbank Learning for Temporal Modulation Features in Replay Spoof Speech Detection , 2018, INTERSPEECH.

[42]  Jian Yang,et al.  ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems , 2019, INTERSPEECH.

[43]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[44]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[45]  Hemant A. Patil,et al.  Novel Empirical Mode Decomposition Cepstral Features for Replay Spoof Detection , 2018, INTERSPEECH.

[46]  A.E. Rosenberg,et al.  Automatic speaker verification: A review , 1976, Proceedings of the IEEE.

[47]  Eliathamby Ambikairajah,et al.  Frequency Domain Linear Prediction Features for Replay Spoofing Attack Detection , 2018, INTERSPEECH.

[48]  Arun Ross,et al.  50 years of biometric research: Accomplishments, challenges, and opportunities , 2016, Pattern Recognit. Lett..

[49]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[51]  Ángel M. Gómez,et al.  A Deep Identity Representation for Noise Robust Spoofing Detection , 2018, INTERSPEECH.

[52]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[53]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[54]  Jakub Galka,et al.  Audio Replay Attack Detection Using High-Frequency Features , 2017, INTERSPEECH.

[55]  Tomi Kinnunen,et al.  Spoofing and countermeasures for automatic speaker verification , 2013, INTERSPEECH.

[56]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[57]  Ming Li,et al.  The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion , 2019, INTERSPEECH.

[58]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[59]  Petros Maragos,et al.  Auditory Teager energy cepstrum coefficients for robust speech recognition , 2005, INTERSPEECH.

[60]  Madhu R. Kamble,et al.  Detection of replay spoof speech using teager energy feature cues , 2021, Comput. Speech Lang..

[61]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[62]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[63]  Yannis Stylianou,et al.  Voice Transformation: A survey , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[64]  Yun Lei,et al.  Calibration and multiple system fusion for spoken term detection using linear logistic regression , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Hemant A. Patil,et al.  Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[66]  M. Wagner,et al.  Vulnerability of speaker verification to voice mimicking , 2004, Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004..

[67]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[68]  Hemant A. Patil,et al.  Energy Separation-Based Instantaneous Frequency Estimation for Cochlear Cepstral Feature for Replay Spoof Detection , 2019, INTERSPEECH.

[69]  Bob L. Sturm,et al.  Ensemble Models for Spoofing Detection in Automatic Speaker Verification , 2019, INTERSPEECH.

[70]  H. Teager Some observations on oral air flow during phonation , 1980 .

[71]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[72]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[73]  Vidhyasaharan Sethu,et al.  Modulation Dynamic Features for the Detection of Replay Attacks , 2018, INTERSPEECH.

[74]  Madhu R. Kamble,et al.  Novel Variable Length Energy Separation Algorithm Using Instantaneous Amplitude Features for Replay Detection , 2018, INTERSPEECH.

[75]  Madhu R. Kamble,et al.  Effectiveness of Speech Demodulation-Based Features for Replay Detection , 2018, INTERSPEECH.

[76]  A. Oppenheim Speech analysis-synthesis system based on homomorphic filtering. , 1969, The Journal of the Acoustical Society of America.

[77]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[78]  Rodrigo Capobianco Guido,et al.  Enhancing teager energy operator based on a novel and appealing concept: Signal mass , 2019, J. Frankl. Inst..

[79]  Rodrigo Capobianco Guido Paraconsistent Feature Engineering [Lecture Notes] , 2019, IEEE Signal Processing Magazine.