Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: 1) Granular Region Division, 2) Prompt Designing, 3) Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIP ann CLIP-AD, and further research is needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.

[1]  Jiangning Zhang,et al.  CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection , 2023, ArXiv.

[2]  Yanyan Zhao,et al.  An Early Evaluation of GPT-4V(ision) , 2023, ArXiv.

[3]  Dezhi Peng,et al.  Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation , 2023, ArXiv.

[4]  Kevin Lin,et al.  The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) , 2023, ArXiv.

[5]  Zhaopeng Gu,et al.  AnomalyGPT: Detecting Industrial Anomalies using Large Vision-Language Models , 2023, ArXiv.

[6]  Jiangning Zhang,et al.  A Zero-/Few-Shot Anomaly Classification and Segmentation Method for CVPR 2023 VAND Workshop Challenge Tracks 1&2: 1st Place on Zero-shot AD and 4th Place on Few-shot AD , 2023, arXiv.org.

[7]  Y. Cheng,et al.  Segment Any Anomaly without Training via Hybrid Prompt Regularization , 2023, ArXiv.

[8]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Zilei Wang,et al.  SimpleNet: A Simple Network for Image Anomaly Detection and Localization , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Avinash Ravichandran,et al.  WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  O. Dabeer,et al.  SPot-the-Difference Self-Supervised Pre-training for Anomaly Detection and Segmentation , 2022, ECCV.

[12]  Xin Lu,et al.  A Unified Model for Multi-class Anomaly Detection , 2022, NeurIPS.

[13]  Yong Liu,et al.  Omni-Frequency Channel-Selection Representations for Unsupervised Anomaly Detection , 2022, IEEE Transactions on Image Processing.

[14]  Xingyu Li,et al.  Anomaly Detection via Reverse Distillation from One-Class Embedding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  D. Skočaj,et al.  DRÆM – A discriminatively trained reconstruction embedding for surface anomaly detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  B. Schölkopf,et al.  Towards Total Recall in Industrial Anomaly Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Paul Bergmann,et al.  Uninformed Students: Student-Teacher Anomaly Detection With Discriminative Latent Embeddings , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Carsten Steger,et al.  MVTec AD — A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jianwei Yang,et al.  Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , 2023 .