When the Differences in Frequency Domain are Compensated: Understanding and Defeating Modulated Replay Attacks on Automatic Speech Recognition

Automatic speech recognition (ASR) systems have been widely deployed in modern smart devices to provide convenient and diverse voice-controlled services. Since ASR systems are vulnerable to audio replay attacks that can spoof and mislead ASR systems, a number of defense systems have been proposed to identify replayed audio signals based on the speakers' unique acoustic features in the frequency domain. In this paper, we uncover a new type of replay attack called modulated replay attack, which can bypass the existing frequency domain based defense systems. The basic idea is to compensate for the frequency distortion of a given electronic speaker using an inverse filter that is customized to the speaker's transform characteristics. Our experiments on real smart devices confirm the modulated replay attacks can successfully escape the existing detection mechanisms that rely on identifying suspicious features in the frequency domain. To defeat modulated replay attacks, we design and implement a countermeasure named DualGuard. We discover and formally prove that no matter how the replay audio signals could be modulated, the replay attacks will either leave ringing artifacts in the time domain or cause spectrum distortion in the frequency domain. Therefore, by jointly checking suspicious features in both frequency and time domains, DualGuard~can successfully detect various replay attacks including the modulated replay attacks. We implement a prototype of DualGuard~on a popular voice interactive platform, ReSpeaker Core v2. The experimental results show DualGuard~can achieve 98% accuracy on detecting modulated replay attacks.

[1]  Chng Eng Siong,et al.  Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jakub Galka,et al.  Audio Replay Attack Detection Using High-Frequency Features , 2017, INTERSPEECH.

[3]  Wenyuan Xu,et al.  The Catcher in the Field: A Fieldprint based Spoofing Detection for Text-Independent Speaker Verification , 2019, CCS.

[4]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[5]  Parav Nagarsheth,et al.  Replay Attack Detection Using DNN for Channel Discrimination , 2017, INTERSPEECH.

[6]  Yuqiong Sun,et al.  AuDroid: Preventing Attacks on Audio Channels in Mobile Devices , 2015, ACSAC.

[7]  Wenke Lee,et al.  A11y Attacks: Exploiting Accessibility in Operating Systems , 2014, CCS.

[8]  Madhu R. Kamble,et al.  Novel energy separation based instantaneous frequency features for spoof speech detection , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[9]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[10]  Yuan Tian,et al.  Understanding and Mitigating the Security Risks of Voice-Controlled Third-Party Skills on Amazon Alexa and Google Home , 2018, ArXiv.

[11]  Patrick Traynor,et al.  Hello, Is It Me You're Looking For?: Differentiating Between Human and Electronic Speakers for Voice Interface Security , 2018, WISEC.

[12]  Constantinos Patsakis,et al.  Monkey Says, Monkey Does: Security and Privacy on Voice Assistants , 2017, IEEE Access.

[13]  Tomi Kinnunen,et al.  I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry , 2013, INTERSPEECH.

[14]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Eduardo Lleida,et al.  Preventing replay attacks on speaker verification systems , 2011, 2011 Carnahan Conference on Security Technology.

[16]  Nicholas W. D. Evans,et al.  Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification , 2017, Comput. Speech Lang..

[17]  Eliathamby Ambikairajah,et al.  Detection of Replay-Spoofing Attacks Using Frequency Modulation Features , 2018, INTERSPEECH.

[18]  Christian Poellabauer,et al.  Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues , 2018, 2018 27th International Conference on Computer Communication and Networks (ICCCN).

[19]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[20]  Madhu R. Kamble,et al.  Novel Amplitude Weighted Frequency Modulation Features for Replay Spoof Detection , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[21]  Haizhou Li,et al.  Spoofing and countermeasures for speaker verification: A survey , 2015, Speech Commun..

[22]  Jie Yang,et al.  Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication , 2017, CCS.

[23]  Hafiz Malik,et al.  Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay Attacks , 2019, 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).

[24]  Lauri Juvela,et al.  ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech , 2019, Comput. Speech Lang..

[25]  Patrick Traynor,et al.  2MA: Verifying Voice Commands via Two Microphone Authentication , 2018, AsiaCCS.

[26]  Galina Lavrentyeva,et al.  STC anti-spoofing systems for the ASVspoof 2015 challenge , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Romit Roy Choudhury,et al.  Inaudible Voice Commands: The Long-Range Attack and Defense , 2018, NSDI.

[28]  Francesco Piazza,et al.  Multiple Position Room Response Equalization in Frequency Domain , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[30]  Hyunsoo Yoon,et al.  POSTER: Detecting Audio Adversarial Example through Audio Modification , 2019, CCS.

[31]  Sharath Pankanti,et al.  Biometrics: Personal Identification in Networked Society , 2013 .

[32]  Xiangyu Liu,et al.  Your Voice Assistant is Mine: How to Abuse Speakers to Steal Information and Control Your Phone , 2014, SPSM@CCS.

[33]  Adrian Bahne,et al.  Compensation of Loudspeaker–Room Responses in a Robust MIMO Control Framework , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[35]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[36]  Prateek Mittal,et al.  POSTER: Inaudible Voice Commands , 2017, CCS.

[37]  Galina Lavrentyeva,et al.  Audio Replay Attack Detection with Deep Learning Frameworks , 2017, INTERSPEECH.

[38]  Zhifeng Xie,et al.  A Comparison of Features for Replay Attack Detection , 2019 .

[39]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[40]  Qian Wang,et al.  Hidden Voice Commands: Attacks and Defenses on the VCS of Autonomous Driving Cars , 2019, IEEE Wireless Communications.

[41]  Vidhyasaharan Sethu,et al.  Modulation Dynamic Features for the Detection of Replay Attacks , 2018, INTERSPEECH.

[42]  Yongmin Li,et al.  Gibbs phenomenon for fractional Fourier series , 2011 .

[43]  Jianwu Dang,et al.  Multiple Phase Information Combination for Replay Attacks Detection , 2018, INTERSPEECH.

[44]  J. Joseph,et al.  Fourier Series , 2018, Series and Products in the Development of Mathematics.

[45]  Gang Wei,et al.  Channel pattern noise based playback attack detection algorithm for speaker recognition , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[46]  Nan Zhang,et al.  Dangerous Skills: Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions on Virtual Personal Assistant Systems , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[47]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[48]  Deepak Kumar,et al.  Skill Squatting Attacks on Amazon Alexa , 2018, USENIX Security Symposium.

[49]  Ahmad-Reza Sadeghi,et al.  Alexa Lied to Me: Skill-based Man-in-the-Middle Attacks on Virtual Assistants , 2019, AsiaCCS.

[50]  Zhizheng Wu,et al.  Voice conversion and spoofing attack on speaker verification systems , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[51]  Jie Yang,et al.  VoiceLive: A Phoneme Localization based Liveness Detection for Voice Authentication on Smartphones , 2016, CCS.

[52]  Lei Xu,et al.  Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications , 2019, NDSS.

[53]  Stefania Cecchi,et al.  Room Response Equalization—A Review , 2017 .

[54]  Marc Moonen,et al.  Embedded-optimization-based loudspeaker compensation using a generic Hammerstein loudspeaker model , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[55]  YiHen Chen,et al.  Phase compensation for multichannel low-frequency response using minimax approximation , 2012, 2012 International Conference on Audio, Language and Image Processing.

[56]  Eduardo Lleida,et al.  Detecting Replay Attacks from Far-Field Recordings on Speaker Verification Systems , 2011, BIOID.

[57]  Junichi Yamagishi,et al.  Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis , 2012, INTERSPEECH.

[58]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.