Attacking Video Recognition Models with Bullet-Screen Comments

Recent research has demonstrated that Deep Neural Networks (DNNs) are vulnerable to adversarial patches which introducing perceptible but localized changes to the input. Nevertheless, existing approaches have focused on generating adversarial patches on images, their counterparts in videos have been less explored. Compared with images, attacking videos is much more challenging as it needs to consider not only spatial cues but also temporal cues. To close this gap, we introduce a novel adversarial attack in this paper, the bulletscreen comment (BSC) attack, which attacks video recognition models with BSCs. Specifically, adversarial BSCs are generated with a Reinforcement Learning (RL) framework, where the environment is set as the target model and the agent plays the role of selecting the position and transparency of each BSC. By continuously querying the target models and receiving feedback, the agent gradually adjusts its selection strategies in order to achieve a high fooling rate with nonoverlapping BSCs. As BSCs can be regarded as a kind of meaningful patch, adding it to a clean video will not affect people’s understanding of the video content, nor will arouse people’s suspicion. We conduct extensive experiments to verify the effectiveness of the proposed method. On both UCF101 and HMDB-51 datasets, our BSC attack method can achieve about 90% fooling rate when attack three mainstream video recognition models, while only occluding <8% areas in the video.

[1]  J. Doye,et al.  Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms , 1997, cond-mat/9803344.

[2]  Bo Shen,et al.  DCT domain alpha blending , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Alvaro Soto,et al.  Human detection using a mobile platform and novel features derived from a visual saliency mechanism , 2010, Image Vis. Comput..

[5]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[9]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[10]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[11]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Pascal Frossard,et al.  Measuring the effect of nuisance variables on classifiers , 2016, BMVC.

[14]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[15]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[16]  Jinfeng Yi,et al.  ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[17]  Zheng Wang,et al.  Catching the Temporal Regions-of-Interest for Video Captioning , 2017, ACM Multimedia.

[18]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[20]  Martín Abadi,et al.  Adversarial Patch , 2017, ArXiv.

[21]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[22]  Sheng Liu,et al.  SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Logan Engstrom,et al.  Black-box Adversarial Attacks with Limited Queries and Information , 2018, ICML.

[24]  Cristian Sminchisescu,et al.  Semantic Video Segmentation by Gated Recurrent Flow Propagation , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Amit K. Roy-Chowdhury,et al.  Adversarial Perturbations Against Real-Time Video Classification Systems , 2018, NDSS.

[27]  Hang Su,et al.  Sparse Adversarial Perturbations for Videos , 2018, AAAI.

[28]  James Bailey,et al.  Black-box Adversarial Attacks on Video Recognition Models , 2019, ACM Multimedia.

[29]  Salman Khan,et al.  Local Gradients Smoothing: Defense Against Localized Adversarial Attacks , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Michael J. Black,et al.  Attacking Optical Flow , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Dacheng Tao,et al.  Perceptual-Sensitive GAN for Generating Adversarial Patches , 2019, AAAI.

[34]  Sanyuan Zhao,et al.  Learning Unsupervised Video Object Segmentation Through Visual Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Hu Zhang,et al.  Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior , 2020, ECCV.

[36]  Xiaochun Cao,et al.  Adv-watermark: A Novel Watermark Perturbation for Adversarial Examples , 2020, ACM Multimedia.

[37]  Tat-Seng Chua,et al.  Video Relation Detection via Multiple Hypothesis Association , 2020, ACM Multimedia.

[38]  Tom Goldstein,et al.  Certified Defenses for Adversarial Patches , 2020, ICLR.

[39]  Tat-Seng Chua,et al.  Heuristic Black-Box Adversarial Attacks on Video Recognition Models , 2019, AAAI 2020.

[40]  Cihang Xie,et al.  PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning , 2020, ECCV.

[41]  Tom Goldstein,et al.  Towards Transferable Adversarial Attacks on Vision Transformers , 2021, AAAI.

[42]  Yu-Gang Jiang,et al.  Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval , 2021, ACM Multimedia.

[43]  Huazhu Fu,et al.  VideoLT: Large-scale Long-tailed Video Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Yu-Gang Jiang,et al.  Cross-Modal Transferable Adversarial Attacks from Images to Videos , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Qi Tian,et al.  Appending Adversarial Frames for Universal Video Attack , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  Guangyi Xiao,et al.  Fine-grained Cross-modal Alignment Network for Text-Video Retrieval , 2021, ACM Multimedia.

[47]  Yu-Gang Jiang,et al.  Boosting the Transferability of Video Adversarial Examples via Temporal Translation , 2021, AAAI.

[48]  Yahong Han,et al.  Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal , 2021, ArXiv.

[49]  Yu-Gang Jiang,et al.  Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval , 2022, IEEE Transactions on Multimedia.