With the proliferation of natural language interfaces on mobile devices and in home personal assistants such as Siri and Alexa, many services and data are becoming available through transcription from a speech recognition system. One major risk factor in this trend is that a malicious adversary may attack this system without the primary user noticing. One way to accomplish this is to use adversarial examples that are perceived one way by a human, but transcribed differently by the Automatic Speech Recognition (ASR) system. For example, a recording that sounds like ”hello” to the human ear, but is transcribed as “goodbye” by the ASR system. Recent work has shown that adversarial examples can be created for convolutional neural networks to fool vision recognition systems. We show that similar methods can be applied to neural ASR systems. We show successful results for two methods of generating adversarial examples where we fool a high quality ASR system but the difference in the audio is imperceptible to the human ear. We also present a method for converting the adversarial MFCC features back into audio.
[1]
Meir Tzur,et al.
Speech reconstruction from mel frequency cepstral coefficients and pitch frequency
,
2000,
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).
[2]
Ananthram Swami,et al.
Practical Black-Box Attacks against Machine Learning
,
2016,
AsiaCCS.
[3]
Micah Sherr,et al.
Hidden Voice Commands
,
2016,
USENIX Security Symposium.
[4]
Jonathon Shlens,et al.
Explaining and Harnessing Adversarial Examples
,
2014,
ICLR.
[5]
Joan Bruna,et al.
Intriguing properties of neural networks
,
2013,
ICLR.
[6]
Junichi Yamagishi,et al.
SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit
,
2016
.