Finding Achilles' Heel: Adversarial Attack on Multi-modal Action Recognition

Neural network-based models are notoriously known for their adversarial vulnerability. Recent adversarial machine learning mainly focused on images, where a small perturbation can be simply added to fool the learning model. Very recently, this practice has been explored in human action video attacks by adding perturbation to key frames. Unfortunately, frame selection is usually computationally expensive in run-time, and adding noises to all frames is unrealistic, either. In this paper, we present a novel yet efficient approach to address this issue. Multi-modal video data such as RGB, depth and skeleton data have been widely used for human action modeling, and they have been demonstrated with superior performance than a single modality. Interestingly, we observed that the skeleton data is more "vulnerable" under adversarial attack, and we propose to leverage this "Achilles' Heel" to attack multi-modal video data. In particular, first, an adversarial learning paradigm is designed to perturb skeleton data for a specific action under a black box setting, which highlights how body joints and key segments in videos are subject to attack. Second, we propose a graph attention model to explore the semantics between segments from different modalities and within a modality. Third, the attack will be launched in run-time on all modalities through the learned semantics. The proposed method has been extensively evaluated on multi-modal visual action datasets, including PKU-MMD and NTU-RGB+D to validate its effectiveness.

[1]  Xin Wen,et al.  Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition , 2019, ACM Multimedia.

[2]  Yu Liu,et al.  RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks , 2018, ArXiv.

[3]  Tat-Seng Chua,et al.  Heuristic Black-Box Adversarial Attacks on Video Recognition Models , 2019, AAAI 2020.

[4]  Yang Yang,et al.  Attention Transfer (ANT) Network for View-invariant Action Recognition , 2019, ACM Multimedia.

[5]  Min Jiang,et al.  RGB-D Action Recognition Using Multimodal Correlative Representation Learning Model , 2019, IEEE Sensors Journal.

[6]  Sergio Escalera,et al.  RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[7]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Stephan Günnemann,et al.  Adversarial Attacks on Node Embeddings via Graph Poisoning , 2018, ICML.

[9]  Jiaying Liu,et al.  PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding , 2017, ArXiv.

[10]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Gang Wang,et al.  Multi-modal feature fusion for action recognition in RGB-D sequences , 2014, 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[12]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[13]  Tao Mei,et al.  Long Short-Term Relation Networks for Video Action Detection , 2019, ACM Multimedia.

[14]  Terrance E. Boult,et al.  Adversarial Diversity and Hard Positive Generation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Stephan Günnemann,et al.  Adversarial Attacks on Neural Networks for Graph Data , 2018, KDD.

[17]  Paramartha Dutta,et al.  Advancements in Image Classification using Convolutional Neural Network , 2018, 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN).

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Khaled Shaalan,et al.  Speech Recognition Using Deep Neural Networks: A Systematic Review , 2019, IEEE Access.

[20]  Ajmal Mian,et al.  Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey , 2018, IEEE Access.

[21]  Houshang Darabi,et al.  Adversarial Attacks on Time Series , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yun Fu,et al.  Generative Multi-View Human Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Xu Zhao,et al.  Attention-Based Multiview Re-Observation Fusion Network for Skeletal Action Recognition , 2019, IEEE Transactions on Multimedia.

[24]  Dan Boneh,et al.  Ensemble Adversarial Training: Attacks and Defenses , 2017, ICLR.

[25]  Bin Sun,et al.  EV-Action: Electromyography-Vision Multi-Modal Action Dataset , 2019, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[26]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[27]  Jian Yu,et al.  Prediction-CGAN: Human Action Prediction with Conditional Generative Adversarial Networks , 2019, ACM Multimedia.

[28]  Miao Yu,et al.  Human-Object Contour for Action Recognition with Attentional Multi-modal Fusion Network , 2019, 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC).

[29]  Xiangnan He,et al.  MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video , 2019, ACM Multimedia.

[30]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[31]  Heng Tao Shen,et al.  Cross-Modal Attention With Semantic Consistence for Image–Text Matching , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Tong Wu,et al.  Action Recognition with Bootstrapping based Long-range Temporal Context Attention , 2019, ACM Multimedia.

[33]  Jian Zhao,et al.  Learning Generalizable and Identity-Discriminative Representations for Face Anti-Spoofing , 2019, ACM Trans. Intell. Syst. Technol..

[34]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[35]  Jun Huang,et al.  A Hierarchical Framwork with Improved Loss for Large-scale Multi-modal Video Identification , 2019, ACM Multimedia.

[36]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[37]  Hao Ye,et al.  Video Emotion Recognition with Concept Selection , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[38]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[39]  Jean-Claude Martin,et al.  HireNet: A Hierarchical Attention Model for the Automatic Analysis of Asynchronous Video Job Interviews , 2019, AAAI.

[40]  Bohyung Han,et al.  Text-Guided Attention Model for Image Captioning , 2016, AAAI.

[41]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[42]  Li Chen,et al.  MetaAdvDet: Towards Robust Detection of Evolving Adversarial Attacks , 2019, ACM Multimedia.

[43]  W. Bruce Croft,et al.  A Zero Attention Model for Personalized Product Search , 2019, CIKM.

[44]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[45]  Liming Zhu,et al.  Adversarial Examples on Graph Data: Deep Insights into Attack and Defense , 2019 .

[46]  Jenq-Neng Hwang,et al.  Multi-View Vehicle Re-Identification using Temporal Attention Model and Metadata Re-ranking , 2019, CVPR Workshops.

[47]  Jun Zhu,et al.  Boosting Adversarial Attacks with Momentum , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[49]  Pan He,et al.  Adversarial Examples: Attacks and Defenses for Deep Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Hang Su,et al.  Sparse Adversarial Perturbations for Videos , 2018, AAAI.

[51]  Fenglong Ma,et al.  MuVAN: A Multi-view Attention Network for Multivariate Temporal Data , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[52]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[53]  James Bailey,et al.  Black-box Adversarial Attacks on Video Recognition Models , 2019, ACM Multimedia.

[54]  LiChenliang,et al.  Adversarial Attacks on Deep-learning Models in Natural Language Processing , 2020 .

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Amit K. Roy-Chowdhury,et al.  Adversarial Perturbations Against Real-Time Video Classification Systems , 2018, NDSS.

[57]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Stephan Gunnemann,et al.  Adversarial Attacks on Graph Neural Networks via Meta Learning , 2019, ICLR.

[60]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).