Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement

The attention mechanism has been proven effective on various visual tasks in recent years. In the semantic segmentation task, the attention mechanism is applied in various methods, including the case of both Convolution Neural Networks (CNN) and Vision Transformer (ViT) as backbones. However, we observe that the attention mechanism is vulnerable to patch-based adversarial attacks. Through the analysis of the effective receptive field, we attribute it to the fact that the wide receptive field brought by global attention may lead to the spread of the adversarial patch. To address this issue, in this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model, which can notably relieve the vulnerability against patch-based attacks. Compared to the vallina attention mechanism, RAM introduces two novel modules called Max Attention Suppression and Random Attention Dropout, both of which aim to refine the attention matrix and limit the influence of a single adversarial patch on the semantic segmentation results of other positions. Extensive experiments demonstrate the effectiveness of our RAM to improve the robustness of semantic segmentation models against various patch-based attack methods under different attack settings.

[1]  Prateek Mittal,et al.  A Light Recipe to Train Robust Vision Transformers , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[2]  Xiaofei He,et al.  Towards Efficient Adversarial Training on Vision Transformers , 2022, ECCV.

[3]  Javier Rando,et al.  Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO , 2022, ArXiv.

[4]  Cihang Xie,et al.  Can CNNs Be More Robust Than Transformers? , 2022, ICLR.

[5]  Shuicheng Yan,et al.  Improving Vision Transformers by Revisiting High-frequency Components , 2022, ECCV.

[6]  Chaithanya Kumar Mummadi,et al.  Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Y. Fu,et al.  Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations? , 2022, ICLR.

[8]  Bo Li,et al.  Towards Practical Certifiable Patch Defense with Vision Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Humphrey Shi,et al.  SeMask: Semantically Masked Transformers for Semantic Segmentation , 2021, 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[11]  James Bailey,et al.  On the Convergence and Robustness of Adversarial Training , 2021, ICML.

[12]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  R. Zabih,et al.  Pyramid Adversarial Training Improves ViT Performance , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Shuicheng Yan,et al.  MetaFormer is Actually What You Need for Vision , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Volker Tresp,et al.  Are Vision Transformers Robust to Patch Perturbations? , 2021, ECCV.

[16]  Alan Yuille,et al.  Are Transformers More Robust Than CNNs? , 2021, NeurIPS.

[17]  Hadi Salman,et al.  Certified Patch Robustness via Smoothed Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Philipp Benz Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs , 2021, BMVC.

[19]  Daniel Stanley Tan,et al.  Naturalistic Physical Adversarial Patch for Object Detectors , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Saasha Nair,et al.  Evaluating the Robustness of Semantic Segmentation for Autonomous Driving against Real-World Adversarial Patch Attacks , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[21]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[22]  Jun Zhu,et al.  Improving Transferability of Adversarial Patches on Face Recognition with Generative Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[24]  Hui Xue,et al.  Towards Robust Vision Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Yisroel Mirsky IPatch: a remote adversarial patch , 2021, Cybersecurity.

[27]  Xingxing Wei,et al.  Adversarial Sticker: A Stealthy Attack Method in the Physical World , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Mingxing Tan,et al.  EfficientNetV2: Smaller Models and Faster Training , 2021, ICML.

[29]  Marten van Dijk,et al.  On the Robustness of Vision Transformers to Adversarial Examples , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Andreas Veit,et al.  Understanding Robustness of Transformers for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[33]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[34]  Jan Hendrik Metzen,et al.  Increasing the Robustness of Semantic Segmentation Models with Painting-by-Numbers , 2020, ECCV.

[35]  Qi Alfred Chen,et al.  Dirty Road Can Attack: Security of Deep Learning based Automated Lane Centering under Physical-World Attack , 2020, USENIX Security Symposium.

[36]  Mathieu Serrurier,et al.  Achieving robustness in classification using optimal transport with hinge regularization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Cihang Xie,et al.  PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning , 2020, ECCV.

[38]  Matthias Hein,et al.  Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , 2020, ICML.

[39]  Nicolas Flammarion,et al.  Square Attack: a query-efficient black-box adversarial attack via random search , 2019, ECCV.

[40]  Mathieu Salzmann,et al.  Indirect Local Attacks for Context-aware Semantic Segmentation Networks , 2019, ECCV.

[41]  Carsten Rother,et al.  Benchmarking the Robustness of Semantic Segmentation Models with Respect to Common Corruptions , 2019, International Journal of Computer Vision.

[42]  Hong Liu,et al.  Expectation-Maximization Attention Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Alexander S. Ecker,et al.  Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , 2019, ArXiv.

[44]  Matthias Hein,et al.  Minimally distorted Adversarial Examples with a Fast Adaptive Boundary Attack , 2019, ICML.

[45]  Mark Lee,et al.  On Physical Adversarial Patches for Object Detection , 2019, ArXiv.

[46]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jingdong Wang,et al.  OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[49]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[50]  Xin Liu,et al.  DPATCH: An Adversarial Patch Attack on Object Detectors , 2018, SafeAI@AAAI.

[51]  Xing Ji,et al.  CosFace: Large Margin Cosine Loss for Deep Face Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  S. Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Yoav Goldberg,et al.  LaVAN: Localized and Visible Adversarial Noise , 2018, ICML.

[54]  Martín Abadi,et al.  Adversarial Patch , 2017, ArXiv.

[55]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[57]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[60]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[61]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Alan L. Yuille,et al.  Adversarial Examples for Semantic Segmentation and Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[64]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[70]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[72]  Cho-Jui Hsieh,et al.  On the Adversarial Robustness of Vision Transformers , 2022, Trans. Mach. Learn. Res..

[73]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[74]  Yuheng Huang,et al.  Zero-Shot Certified Defense against Adversarial Patches with Vision Transformers , 2021, ArXiv.

[75]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..