Generating and Protecting Against Adversarial Attacks for Deep Speech-Based Emotion Recognition Models

The development of deep learning models for speech emotion recognition has become a popular area of research. Adversarially generated data can cause false predictions, and in an endeavor to ensure model robustness, defense methods against such attacks should be addressed. With this in mind, in this study, we aim to train deep models to defending against non-targeted white-box adversarial attacks. Adversarial data is first generated from the real data using the fast gradient sign method. Then in the research field of speech emotion recognition, adversarial-based training is employed as a method for protecting against adversarial attack. We then train deep convolutional models with both real and adversarial data, and compare the performances of two adversarial training procedures - namely, vanilla adversarial training, and similarity-based adversarial training. In our experiments, through the use of adversarial data augmentation, both of the considered adversarial training procedures can improve the performance when validated on the real data. Additionally, the similarity-based adversarial training learns a more robust model when working with adversarial data. Finally, the considered VGG-16 model performs the best across all models, for both real and generated data.

[1]  Alan L. Yuille,et al.  Adversarial Examples for Semantic Segmentation and Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Li Zhang,et al.  Intelligent Facial Action and emotion recognition for humanoid robots , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[4]  Javier R. Movellan,et al.  The Faces of Engagement: Automatic Recognition of Student Engagementfrom Facial Expressions , 2014, IEEE Transactions on Affective Computing.

[5]  Björn W. Schuller,et al.  DEMoS: an Italian emotional speech corpus , 2019, Language Resources and Evaluation.

[6]  Renzhi Cao,et al.  Survey of AI in Cybersecurity for Information Technology Management , 2019, 2019 IEEE Technology & Engineering Management Conference (TEMSCON).

[7]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[8]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[9]  Dawn Xiaodong Song,et al.  Delving into Transferable Adversarial Examples and Black-box Attacks , 2016, ICLR.

[10]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[11]  Björn Schuller,et al.  Can Deep Generative Audio be Emotional? Towards an Approach for Personalised Emotional Audio Generation , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[12]  Jinfeng Yi,et al.  Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning , 2017, ACL.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Christian Poellabauer,et al.  Crafting Adversarial Examples For Speech Paralinguistics Applications , 2017, ArXiv.

[15]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[16]  Björn Schuller,et al.  Towards Robust Speech Emotion Recognition Using Deep Residual Networks for Speech Enhancement , 2019, INTERSPEECH.

[17]  Zhao Ren,et al.  Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition , 2019, IEEE Access.

[18]  Zhao Ren,et al.  EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings , 2019, IEEE Transactions on Affective Computing.

[19]  Björn Schuller,et al.  Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Björn Schuller,et al.  Deep Recurrent Neural Networks for Emotion Recognition in Speech , 2018 .

[21]  Björn W. Schuller,et al.  Learning Image-based Representations for Heart Sound Classification , 2018, DH.

[22]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[23]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mark D. Plumbley,et al.  Attention-based convolutional neural networks for acoustic scene classification , 2018, DCASE.

[26]  David A. Wagner,et al.  Audio Adversarial Examples: Targeted Attacks on Speech-to-Text , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[27]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Wen Gao,et al.  Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching , 2018, IEEE Transactions on Multimedia.

[29]  Karla Conn Welch,et al.  Physiological signals of autistic children can be useful , 2012, IEEE Instrumentation & Measurement Magazine.