Robust Detection of Machine-induced Audio Attacks in Intelligent Audio Systems with Microphone Array

With the popularity of intelligent audio systems in recent years, their vulnerabilities have become an increasing public concern. Existing studies have designed a set of machine-induced audio attacks, such as replay attacks, synthesis attacks, hidden voice commands, inaudible attacks, and audio adversarial examples, which could expose users to serious security and privacy threats. To defend against these attacks, existing efforts have been treating them individually. While they have yielded reasonably good performance in certain cases, they can hardly be combined into an all-in-one solution to be deployed on the audio systems in practice. Additionally, modern intelligent audio devices, such as Amazon Echo and Apple HomePod, usually come equipped with microphone arrays for far-field voice recognition and noise reduction. Existing defense strategies have been focusing on single- and dual-channel audio, while only few studies have explored using multi-channel microphone array for defending specific types of audio attack. Motivated by the lack of systematic research on defending miscellaneous audio attacks and the potential benefits of multi-channel audio, this paper builds a holistic solution for detecting machine-induced audio attacks leveraging multi-channel microphone arrays on modern intelligent audio systems. Specifically, we utilize magnitude and phase spectrograms of multi-channel audio to extract spatial information and leverage a deep learning model to detect the fundamental difference between human speech and adversarial audio generated by the playback machines. Moreover, we adopt an unsupervised domain adaptation training framework to further improve the model's generalizability in new acoustic environments. Evaluation is conducted under various settings on a public multi-channel replay attack dataset and a self-collected multi-channel audio attack dataset involving 5 types of advanced audio attacks. The results show that our method can achieve an equal error rate (EER) as low as 6.6% in detecting a variety of machine-induced attacks. Even in new acoustic environments, our method can still achieve an EER as low as 8.8%.

[1]  Jian Liu,et al.  AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations , 2020, CCS.

[2]  Galina Lavrentyeva,et al.  STC Antispoofing Systems for the ASVspoof2019 Challenge , 2019, INTERSPEECH.

[3]  Jakub Galka,et al.  Audio Replay Attack Detection Using High-Frequency Features , 2017, INTERSPEECH.

[4]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[5]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[6]  Aziz Mohaisen,et al.  You Can Hear But You Cannot Steal: Defending Against Voice Impersonation Attacks on Smartphones , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[7]  Chen Wang,et al.  Defeating hidden audio channel attacks on voice assistants via audio-induced surface vibrations , 2019, ACSAC.

[8]  Zhizheng Wu,et al.  Voice conversion and spoofing attack on speaker verification systems , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[9]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Haizhou Li,et al.  Spoofing speech detection using high dimensional magnitude and phase features: the NTU approach for ASVspoof 2015 challenge , 2015, INTERSPEECH.

[11]  Hemlata Tak,et al.  End-to-end anti-spoofing with RawNet2 , 2020 .

[12]  Kong-Aik Lee,et al.  The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection , 2017, INTERSPEECH.

[13]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Run Wang,et al.  DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices , 2020, ACM Multimedia.

[15]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[16]  Chengyi Wang,et al.  Low Latency End-to-End Streaming Speech Recognition with a Scout Network , 2020, INTERSPEECH.

[17]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bo Yuan,et al.  Practical Adversarial Attacks Against Speaker Recognition Systems , 2020, HotMobile.

[19]  Yan Wang,et al.  WearID: Low-Effort Wearable-Assisted Authentication of Voice Commands via Cross-Domain Comparison without Training , 2020, ACSAC.

[20]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Jie Yang,et al.  VoiceLive: A Phoneme Localization based Liveness Detection for Voice Authentication on Smartphones , 2016, CCS.

[22]  F. Koushanfar,et al.  WaveGuard: Understanding and Mitigating Audio Adversarial Examples , 2021, USENIX Security Symposium.

[23]  Zhi Zhou,et al.  Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing , 2019, IEEE Transactions on Wireless Communications.

[24]  Longbiao Wang,et al.  Relative phase information for detecting human speech and spoofed speech , 2015, INTERSPEECH.

[25]  Tomi Kinnunen,et al.  A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[26]  Longbiao Wang,et al.  Replay Attack Detection Using Magnitude and Phase Information with Attention-based Adaptive Filters , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Junichi Yamagishi,et al.  Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification , 2015, INTERSPEECH.

[28]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[29]  Jie Yang,et al.  Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication , 2017, CCS.

[30]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[31]  Jian Yang,et al.  ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems , 2019, INTERSPEECH.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Chin-Hui Lee,et al.  Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Cemal Hanilçi Features and classifiers for replay spoofing attack detection , 2017, 2017 10th International Conference on Electrical and Electronics Engineering (ELECO).

[35]  Jun Ho Huh,et al.  Void: A fast and light voice liveness detection system , 2020, USENIX Security Symposium.

[36]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[37]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[38]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[39]  Guoming Zhang,et al.  EarArray: Defending against DolphinAttack via Acoustic Attenuation , 2021, NDSS.

[40]  Nitesh Saxena,et al.  All Your Voices are Belong to Us: Stealing Voices to Fool Humans and Machines , 2015, ESORICS.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[43]  Xin Wang,et al.  A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection , 2021, Interspeech.

[44]  Patrick Traynor,et al.  2MA: Verifying Voice Commands via Two Microphone Authentication , 2018, AsiaCCS.

[45]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[46]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[47]  Tara N. Sainath,et al.  Locally-connected and convolutional neural networks for small footprint speaker recognition , 2015, INTERSPEECH.

[48]  Jon Sánchez,et al.  Toward a Universal Synthetic Speech Spoofing Detection Using Phase Information , 2015, IEEE Transactions on Information Forensics and Security.

[49]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[50]  Romit Roy Choudhury,et al.  Inaudible Voice Commands: The Long-Range Attack and Defense , 2018, NSDI.

[51]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[52]  Hitoshi Kiya,et al.  Replay Attack Detection Using Generalized Cross-Correlation of Stereo Signal , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[53]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Shwetak N. Patel,et al.  Opportunistic Sensing with MIC Arrays on Smart Speakers for Distal Interaction and Exercise Tracking , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Hemlata Tak,et al.  Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers , 2020, INTERSPEECH.

[56]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[57]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[58]  Kang G. Shin,et al.  Continuous Authentication for Voice Assistants , 2017, MobiCom.

[59]  Longbiao Wang,et al.  Replay Attack Detection Using Linear Prediction Analysis-Based Relative Phase Features , 2019, IEEE Access.

[60]  Xinbing Wang,et al.  Canceling Inaudible Voice Commands Against Voice Control Systems , 2019, MobiCom.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ibon Saratxaga,et al.  Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[63]  Jian Cheng,et al.  Quantized CNN: A Unified Approach to Accelerate and Compress Convolutional Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[64]  Chunhua Deng,et al.  PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[66]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[67]  Kevin Fu,et al.  Light Commands: Laser-Based Audio Injection Attacks on Voice-Controllable Systems , 2020, USENIX Security Symposium.

[68]  Jian Shen,et al.  Wasserstein Distance Guided Representation Learning for Domain Adaptation , 2017, AAAI.

[69]  F. Alton Everest,et al.  Master handbook of acoustics , 1981 .

[70]  Wenyuan Xu,et al.  The Catcher in the Field: A Fieldprint based Spoofing Detection for Text-Independent Speaker Verification , 2019, CCS.

[71]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[72]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Junichi Yamagishi,et al.  Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis , 2012, INTERSPEECH.

[74]  C. Poellabauer,et al.  Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method , 2020, IEEE Signal Processing Letters.

[75]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[76]  Tomi Kinnunen,et al.  ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection , 2019, INTERSPEECH.

[77]  Micah Sherr,et al.  Cocaine Noodles: Exploiting the Gap between Human and Machine Speech Recognition , 2015, WOOT.

[78]  Patrick Traynor,et al.  Hello, Is It Me You're Looking For?: Differentiating Between Human and Electronic Speakers for Voice Interface Security , 2018, WISEC.

[79]  Tieniu Tan,et al.  A Light CNN for Deep Face Representation With Noisy Labels , 2015, IEEE Transactions on Information Forensics and Security.

[80]  Qi Li,et al.  When the Differences in Frequency Domain are Compensated: Understanding and Defeating Modulated Replay Attacks on Automatic Speech Recognition , 2020, CCS.

[81]  Junichi Yamagishi,et al.  ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan , 2021, ArXiv.