Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.

[1]  Haiying Xia,et al.  A dataset for the visually impaired walk on the road , 2023, Displays.

[2]  Seungkyu Lee,et al.  Faster Segment Anything: Towards Lightweight SAM for Mobile Applications , 2023, ArXiv.

[3]  Tao Yu,et al.  Fast Segment Anything , 2023, ArXiv.

[4]  Ross B. Girshick,et al.  Segment Anything , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jun-Juan Zhu,et al.  Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , 2023, ECCV.

[6]  Guoxin Li,et al.  Sensing and Navigation of Wearable Assistance Cognitive Systems for the Visually Impaired , 2023, IEEE Transactions on Cognitive and Developmental Systems.

[7]  R. Stiefelhagen,et al.  MateRobot: Material Recognition in Wearable Robotics for People with Visual Impairments , 2023, ArXiv.

[8]  Yan Zhang,et al.  "I am the follower, also the boss": Exploring Different Levels of Autonomy and Machine Forms of Guiding Robots for the Visually Impaired , 2023, CHI.

[9]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[10]  Yi Wang,et al.  Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  D. Gurari,et al.  Salient Object Detection for Images Taken by People With Vision Impairments , 2023, ArXiv.

[12]  Wujie Zhou,et al.  MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding , 2023, IEEE Transactions on Intelligent Vehicles.

[13]  Weidi Xie,et al.  Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models , 2022, BMVC.

[14]  Bichen Wu,et al.  Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kuan-Ching Li,et al.  EOS: An efficient obstacle segmentation for blind guiding , 2023, Future Gener. Comput. Syst..

[16]  Weidong Min,et al.  Traffic Sign Recognition Based on Semantic Scene Understanding and Structural Traffic Sign Location , 2022, IEEE Transactions on Intelligent Transportation Systems.

[17]  A. Hauptmann,et al.  GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement , 2022, ACM Multimedia.

[18]  D. Gurari,et al.  VizWiz-FewShot: Locating Objects in Images Taken by People With Visual Impairments , 2022, ECCV.

[19]  Jianlong Fu,et al.  TinyViT: Fast Pretraining Distillation for Small Vision Transformers , 2022, ECCV.

[20]  Chia-Wen Lin,et al.  Unsupervised Foggy Scene Understanding via Self Spatial-Temporal Label Diffusion , 2022, IEEE Transactions on Image Processing.

[21]  Suha Kwak,et al.  Collaborative Transformers for Grounded Situation Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Miaojing Shi,et al.  Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  J. A. D. Santos,et al.  Conditional Reconstruction for Open-Set Semantic Segmentation , 2022, 2022 IEEE International Conference on Image Processing (ICIP).

[24]  R. Stiefelhagen,et al.  TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation , 2022, IEEE Transactions on Intelligent Transportation Systems.

[25]  Shalini De Mello,et al.  GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[27]  Svenja Uhlemeyer,et al.  Towards Unsupervised Open World Semantic Segmentation , 2022, UAI.

[28]  M. Kawanabe,et al.  ScanQA: 3D Question Answering for Spatial Scene Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Tat-Seng Chua,et al.  Rethinking the Two-Stage Framework for Grounded Situation Recognition , 2021, AAAI.

[30]  Dengxin Dai,et al.  Both Style and Fog Matter: Cumulative Domain Adaptation for Semantic Foggy Scene Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Suha Kwak,et al.  Grounded Situation Recognition with Transformers , 2021, BMVC.

[32]  Zhengcai Cao,et al.  Rapid Detection of Blind Roads and Crosswalks by Using a Lightweight Semantic Segmentation Network , 2021, IEEE Transactions on Intelligent Transportation Systems.

[33]  Chen Zhao,et al.  A dataset for the recognition of obstacles on blind sidewalk , 2021, Universal Access in the Information Society.

[34]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[35]  Lu Zhang,et al.  Dynamic Crosswalk Scene Understanding for the Visually Impaired , 2021, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[36]  Rainer Stiefelhagen,et al.  Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[37]  Rainer Stiefelhagen,et al.  HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[38]  Rainer Stiefelhagen,et al.  MASS: Multi-Attentional Semantic Segmentation of LiDAR Data for Dense Top-View Understanding , 2021, IEEE Transactions on Intelligent Transportation Systems.

[39]  Tien-Ying Kuo,et al.  Egocentric-View Fingertip Detection for Air Writing Based on Convolutional Neural Networks † , 2021, Sensors.

[40]  Yun Liu,et al.  P2T: Pyramid Pooling Transformer for Scene Understanding , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[42]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Luc Van Gool,et al.  ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Zhenhua Chai,et al.  Rethinking BiSeNet For Real-time Semantic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Didier Stricker,et al.  A Comparison of Single and Multi-View IR image-based AR Glasses Pose Estimation Approaches , 2021, 2021 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW).

[46]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[47]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Tatsuo Arai,et al.  A Wearable Navigation Device for Visually Impaired People Based on the Real-Time Semantic Visual SLAM System , 2021, Sensors.

[49]  Mohammad Mahmudul Alam,et al.  Unified learning approach for egocentric hand gesture recognition and fingertip detection , 2021, Pattern Recognit..

[50]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  I-Hsuan Hsieh,et al.  Outdoor walking guide for the visually-impaired people based on semantic segmentation and depth map , 2020, 2020 International Conference on Pervasive Artificial Intelligence (ICPAI).

[52]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[53]  Xuming He,et al.  Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images , 2020, MICCAI.

[54]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[55]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[56]  Ali Farhadi,et al.  Grounded Situation Recognition , 2020, ECCV.

[57]  Yingda Xia,et al.  Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation , 2020, ECCV.

[58]  Tian Sheuan Chang,et al.  Semantic Segmentation of Intracranial Hemorrhages in Head CT Scans , 2019, 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS).

[59]  Shiguo Lian,et al.  Deep Learning Based Wearable Assistive System for Visually Impaired People , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[60]  Alexander H. Liu,et al.  Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Edward K. Wong,et al.  Cross-Safe: A Computer Vision-Based Approach to Make All Intersection-Related Pedestrian Signals Accessible for the Visually Impaired , 2019, Advances in Intelligent Systems and Computing.

[62]  Jiaya Jia,et al.  Situation Recognition with Graph Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Stephen Gould,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[64]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[65]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Patrick Pérez,et al.  The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces , 2015, CHI.

[68]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[70]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  D. Kahneman Maps of Bounded Rationality: Psychology for Behavioral Economics , 2003 .

[72]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[73]  Wenbin Zou,et al.  Real-Time Passable Area Segmentation With Consumer RGB-D Cameras for the Visually Impaired , 2023, IEEE Transactions on Instrumentation and Measurement.

[74]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Kuan-Wen Chen,et al.  V-Eye: A Vision-Based Navigation System for the Visually Impaired , 2021, IEEE Transactions on Multimedia.