Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information

Adversarial attacks against commercial black-box speech platforms, including cloud speech APIs and voice control devices, have received little attention until recent years. Constructing such attacks is difficult mainly due to the unique characteristics of time-domain speech signals and the much more complex architecture of acoustic systems. The current "black-box" attacks all heavily rely on the knowledge of prediction/confidence scores or other probability information to craft effective adversarial examples (AEs), which can be intuitively defended by service providers without returning these messages. In this paper, we take one more step forward and propose two novel adversarial attacks in more practical and rigorous scenarios. For commercial cloud speech APIs, we propose Occam, a decision-only black-box adversarial attack, where only final decisions are available to the adversary. In Occam, we formulate the decision-only AE generation as a discontinuous large-scale global optimization problem, and solve it by adaptively decomposing this complicated problem into a set of sub-problems and cooperatively optimizing each one. Our Occam is a one-size-fits-all approach, which achieves 100% success rates of attacks (SRoA) with an average SNR of 14.23dB, on a wide range of popular speech and speaker recognition APIs, including Google, Alibaba, Microsoft, Tencent, iFlytek, and Jingdong, outperforming the state-of-the-art black-box attacks. For commercial voice control devices, we propose NI-Occam, the first non-interactive physical adversarial attack, where the adversary does not need to query the oracle and has no access to its internal information and training data. We, for the first time, combine adversarial attacks with model inversion attacks, and thus generate the physically-effective audio AEs with high transferability without any interaction with target devices. Our experimental results show that NI-Occam can successfully fool Apple Siri, Microsoft Cortana, Google Assistant, iFlytek and Amazon Echo with an average SRoA of 52% and SNR of 9.65dB, shedding light on non-interactive physical attacks against voice control devices.

[1]  Colin Raffel,et al.  Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition , 2019, ICML.

[2]  Mani Srivastava,et al.  GenAttack: practical black-box attacks with gradient-free optimization , 2018, GECCO.

[3]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sebastian Thrun,et al.  Towards fully autonomous driving: Systems and algorithms , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Bo Yang,et al.  Adversarial example generation with adabelief optimizer and crop invariance , 2021, Applied Intelligence.

[7]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Tutorial , 2016, ArXiv.

[8]  Christian Igel,et al.  A computational efficient covariance matrix update and a (1+1)-CMA for evolution strategies , 2006, GECCO.

[9]  J. Zico Kolter,et al.  Fast is better than free: Revisiting adversarial training , 2020, ICLR.

[10]  Dawn Xiaodong Song,et al.  Curriculum Adversarial Training , 2018, IJCAI.

[11]  Najim Dehak,et al.  Investigation on Neural Bandwidth Extension of Telephone Speech for Improved Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[13]  Debnath Bhattacharyya,et al.  Biometric Authentication: A Review , 2009 .

[14]  Shrikanth Narayanan,et al.  Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems , 2021, Comput. Speech Lang..

[15]  Aziz Mohaisen,et al.  You Can Hear But You Cannot Steal: Defending Against Voice Impersonation Attacks on Smartphones , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[16]  Si Chen,et al.  Improved Techniques for Model Inversion Attacks , 2021, ArXiv.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Kai Chen,et al.  Devil's Whisper: A General Approach for Physical Adversarial Attacks against Commercial Black-box Speech Recognition Devices , 2020, USENIX Security Symposium.

[19]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[20]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[21]  Dawn Song,et al.  The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kenneth V. Price,et al.  Differential Evolution: A Practical Approach to Global Optimization , 2014 .

[23]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[24]  Qian Wang,et al.  Recent Advances in Adversarial Training for Adversarial Robustness , 2021, IJCAI.

[26]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[27]  Jiqiang Liu,et al.  Generalizing Adversarial Examples by AdaBelief Optimizer , 2021, ArXiv.

[28]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[29]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[30]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[31]  Ting Wang,et al.  SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Systems , 2019, AsiaCCS.

[32]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[33]  Jinfeng Yi,et al.  Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach , 2018, ICLR.

[34]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ke Tang,et al.  Scaling Up Covariance Matrix Adaptation Evolution Strategy Using Cooperative Coevolution , 2013, IDEAL.

[36]  Qian Wang,et al.  Towards Query-Efficient Adversarial Attacks Against Automatic Speech Recognition Systems , 2021, IEEE Transactions on Information Forensics and Security.

[37]  Senthil Mani,et al.  Adversarial Black-Box Attacks for Automatic Speech Recognition Systems Using Multi-Objective Genetic Optimization , 2018, ArXiv.

[38]  William S. Mohn Two Statistical Feature Evaluation Techniques Applied to Speaker Identification , 1971, IEEE Transactions on Computers.

[39]  Najim Dehak,et al.  Study of Pre-Processing Defenses Against Adversarial Attacks on State-of-the-Art Speaker Recognition Systems , 2021, IEEE Transactions on Information Forensics and Security.

[40]  Zhuolin Yang,et al.  Characterizing Audio Adversarial Examples Using Temporal Dependency , 2018, ICLR.

[41]  Patrick Traynor,et al.  Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems , 2019, NDSS.

[42]  Deepak Kumar,et al.  Skill Squatting Attacks on Amazon Alexa , 2018, USENIX Security Symposium.

[43]  Michael I. Jordan,et al.  HopSkipJumpAttack: A Query-Efficient Decision-Based Attack , 2019, 2020 IEEE Symposium on Security and Privacy (SP).

[44]  Lei Xu,et al.  Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications , 2019, NDSS.

[45]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[46]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[47]  Jinfeng Yi,et al.  ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[48]  Wenyuan Xu,et al.  DolphinAttack: Inaudible Voice Commands , 2017, CCS.

[49]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[50]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[51]  Wei Liu,et al.  Efficient Decision-Based Black-Box Adversarial Attacks on Face Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Xin Yao,et al.  Multilevel cooperative coevolution for large scale optimization , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[53]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Thorsten Holz,et al.  Robust Over-the-Air Adversarial Examples Against Automatic Speech Recognition Systems , 2019, ArXiv.

[55]  Darrell Whitley,et al.  A genetic algorithm tutorial , 1994, Statistics and Computing.

[56]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[57]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[58]  Aaron Lawson,et al.  Analysis of Critical Metadata Factors for the Calibration of Speaker Recognition Systems , 2019, INTERSPEECH.

[59]  Hiromu Yakura,et al.  Robust Audio Adversarial Example for a Physical Attack , 2018, IJCAI.

[60]  I. Elamvazuthi,et al.  Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques , 2010, ArXiv.

[61]  Tom Goldstein,et al.  Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets , 2019, ArXiv.

[62]  Nikita Vemuri,et al.  Targeted Adversarial Examples for Black Box Audio Systems , 2018, 2019 IEEE Security and Privacy Workshops (SPW).

[63]  Micah Sherr,et al.  Hidden Voice Commands , 2016, USENIX Security Symposium.

[64]  Moustapha Cissé,et al.  Fooling End-To-End Speaker Verification With Adversarial Examples , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Qian Wang,et al.  Adversarial Examples for Automatic Speech Recognition: Attacks and Countermeasures , 2019, IEEE Communications Magazine.

[66]  Matthias Bethge,et al.  Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models , 2017, ICLR.

[67]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[68]  Yue Zhao,et al.  CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition , 2018, USENIX Security Symposium.

[69]  Larry S. Davis,et al.  Adversarial Training for Free! , 2019, NeurIPS.

[70]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[71]  Dawn Xiaodong Song,et al.  Practical Black-Box Attacks on Deep Neural Networks Using Efficient Query Mechanisms , 2018, ECCV.

[72]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[73]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[74]  Raymond N. J. Veldhuis,et al.  Biometric Authentication System on Mobile Personal Devices , 2010, IEEE Transactions on Instrumentation and Measurement.

[75]  Matthew B Hoy Alexa, Siri, Cortana, and More: An Introduction to Voice Assistants , 2018, Medical reference services quarterly.

[76]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[77]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  Tao Chen,et al.  Metamorph: Injecting Inaudible Commands into Over-the-air Voice Controlled Systems , 2020, NDSS.

[79]  Anne Auger,et al.  Benchmarking the (1+1) evolution strategy with one-fifth success rule on the BBOB-2009 function testbed , 2009, GECCO '09.

[80]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[81]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[82]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[83]  Nan Zhang,et al.  Dangerous Skills: Understanding and Mitigating Security Risks of Voice-Controlled Third-Party Functions on Virtual Personal Assistant Systems , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[84]  Yang Liu,et al.  Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems , 2019, ArXiv.