Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance

Transparent objects, such as glass walls and doors, constitute architectural obstacles hindering the mobility of people with low vision or blindness. For instance, the open space behind glass doors is inaccessible, unless it is correctly perceived and interacted with. However, traditional assistive technologies rarely cover the segmentation of these safety-critical transparent objects. In this paper, we build a wearable system with a novel dualhead Transformer for Transparency (Trans4Trans) perception model, which can segment generaland transparent objects. The two dense segmentation results are further combined with depth information in the system to help users navigate safely and assist them to negotiate transparent obstacles. We propose a lightweight Transformer Parsing Module (TPM) to perform multi-scale feature interpretation in the transformer-based decoder. Benefiting from TPM, the double decoders can perform joint learning from corresponding datasets to pursue robustness, meanwhile maintain efficiency on a portable GPU, with negligible calculation increase. The entire Trans4Trans model is constructed in a symmetrical encoder-decoder architecture, which outperforms state-of-the-art methods on the test sets of Stanford2D3D and Trans10K-v2 datasets, obtaining mIoU of 45.13% and 75.14%, respectively. Through a user study and various pre-tests conducted in indoor and outdoor scenes, the usability and reliability of our assistive This work was supported in part through the AccessibleMaps project by the Federal Ministry of Labor and Social Affairs (BMAS) under the Grant No. 01KM151112, in part by the University of Excellence through the “KIT Future Fields” project, and in part by Hangzhou SurImage Company Ltd. (Corresponding author: Kailun Yang.) 1Authors are with Computer Vision for Human-Computer Interaction Lab, and 2authors are with Center for Digital Accessibility and Assistive Technology, Karlsruhe Institute of Technology, Germany (e-mail: jiaming.zhang@kit.edu, kailun.yang@kit.edu, angela.constantinescu@kit.edu, kunyu.peng@kit.edu, karin.mueller2@kit.edu, rainer.stiefelhagen@kit.edu). Code will be made publicly available at: https://github.com/jamycheung/ Trans4Trans. system have been extensively verified. Meanwhile, the Tran4Trans model has outstanding performances on driving scene datasets. On Cityscapes, ACDC, and DADA-seg datasets corresponding to common environments, adverse weather, and traffic accident scenarios, mIoU scores of 81.5%, 76.3%, and 39.2% are obtained, demonstrating its high efficiency and robustness for real-world transportation applications.

[1]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Rainer Stiefelhagen,et al.  Capturing Omni-Range Context for Omnidirectional Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ann Blandford,et al.  Qualitative HCI Research: Going Behind the Scenes , 2016, Synthesis Lectures on Human-Centered Informatics.

[4]  Shiguo Lian,et al.  Smart guiding glasses for visually impaired people in indoor environment , 2017, IEEE Transactions on Consumer Electronics.

[5]  Xiaojuan Qi,et al.  ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[6]  Chih-Yang Lin,et al.  Content-Aware Video Analysis to Guide Visually Impaired Walking on the Street , 2019, IVIC.

[7]  Rainer Stiefelhagen,et al.  Omnisupervised Omnidirectional Semantic Segmentation , 2020, IEEE Transactions on Intelligent Transportation Systems.

[8]  Anton Kummert,et al.  Descending step classification using time-of-flight sensor data , 2015, 2015 IEEE Intelligent Vehicles Symposium (IV).

[9]  Eugenio Culurciello,et al.  ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[10]  Qiang Zhang,et al.  Don’t Hit Me! Glass Detection in Real-World Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Rainer Stiefelhagen,et al.  HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[13]  Nick Barnes,et al.  Semantic labeling for prosthetic vision , 2016, Comput. Vis. Image Underst..

[14]  Sandra G. Hart,et al.  Nasa-Task Load Index (NASA-TLX); 20 Years Later , 2006 .

[15]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation. , 2020, IEEE transactions on pattern analysis and machine intelligence.

[16]  Wenhai Wang,et al.  Segmenting Transparent Object in the Wild with Transformer , 2021 .

[17]  Lu Zhang,et al.  Dynamic Crosswalk Scene Understanding for the Visually Impaired , 2021, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[18]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Emanuele Frontoni,et al.  Mechatronic System to Help Visually Impaired Users During Walking and Running , 2018, IEEE Transactions on Intelligent Transportation Systems.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Rainer Stiefelhagen,et al.  Panoptic Lintention Network: Towards Efficient Navigational Perception for the Visually Impaired , 2021, 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR).

[22]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.

[24]  Jian Bai,et al.  Glass detection and recognition based on the fusion of ultrasonic sensor and RGB-D sensor for the visually impaired , 2018, Security + Defence.

[25]  Ramesh Raskar,et al.  Deep Polarization Cues for Transparent Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kaitao Song,et al.  PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[27]  Christopher Zach,et al.  ContextNet: Exploring Context and Detail for Semantic Segmentation in Real-time , 2018, BMVC.

[28]  Luis Miguel Bergasa,et al.  Intersection Perception Through Real-Time Semantic Segmentation to Assist Navigation of Visually Impaired Pedestrians , 2018, 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[29]  Rainer Stiefelhagen,et al.  Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[30]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[31]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[32]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kaite Xiang,et al.  Importance-Aware Semantic Segmentation with Efficient Pyramidal Context Network for Navigational Assistant Systems , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[34]  Sandra J. Thompson,et al.  Using the Think Aloud Method (Cognitive Labs) to Evaluate Test Design for Students with Disabilities and English Language Learners. Technical Report 44. , 2006 .

[35]  Hao Chen,et al.  Improving RealSense by Fusing Color Stereo Vision and Infrared Stereo Vision for the Visually Impaired , 2018 .

[36]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  V. Braun,et al.  Using thematic analysis in psychology , 2006 .

[38]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[39]  Roberto Cipolla,et al.  Fast-SCNN: Fast Semantic Segmentation Network , 2019, BMVC.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Emanuele Frontoni,et al.  Embedded Multisensor System for Safe Point-to-Point Navigation of Impaired Users , 2015, IEEE Transactions on Intelligent Transportation Systems.

[42]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chunhua Shen,et al.  Twins: Revisiting Spatial Attention Design in Vision Transformers , 2021, ArXiv.

[44]  Rainer Stiefelhagen,et al.  ISSAFE: Improving Semantic Segmentation in Accidents by Fusing Event-based Data , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[45]  I-Hsuan Hsieh,et al.  Outdoor walking guide for the visually-impaired people based on semantic segmentation and depth map , 2020, 2020 International Conference on Pervasive Artificial Intelligence (ICPAI).

[46]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[47]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jian Sun,et al.  DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Kailun Yang,et al.  PASS: Panoramic Annular Semantic Segmentation , 2020, IEEE Transactions on Intelligent Transportation Systems.

[51]  Qibin Hou,et al.  FakeMix Augmentation Improves Transparent Object Detection , 2021, ArXiv.

[52]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[53]  Zhengcai Cao,et al.  Rapid Detection of Blind Roads and Crosswalks by Using a Lightweight Semantic Segmentation Network , 2021, IEEE Transactions on Intelligent Transportation Systems.

[54]  Yu Wang,et al.  Lednet: A Lightweight Encoder-Decoder Network for Real-Time Semantic Segmentation , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[55]  Laura Giarré,et al.  Enabling independent navigation for visually impaired people through a wearable vision-based feedback system , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[56]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[58]  Youn-Long Lin,et al.  HarDNet: A Low Memory Traffic Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Rynson W. H. Lau,et al.  Rich Context Aggregation with Reflection Prior for Glass Surface Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[61]  Shiguo Lian,et al.  Deep Learning Based Wearable Assistive System for Visually Impaired People , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[62]  Suchendra M. Bhandarkar,et al.  Computer Vision-based Assistance System for the Visually Impaired Using Mobile Edge Artificial Intelligence , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[63]  KAITE XIANG,et al.  Polarization-driven Semantic Segmentation via Efficient Attention-bridged Fusion , 2020, Optics express.

[64]  Josechu J. Guerrero,et al.  Navigation Assistance for the Visually Impaired Using RGB-D Sensor With Range Expansion , 2016, IEEE Systems Journal.

[65]  Kuan-Wen Chen,et al.  V-Eye: A Vision-Based Navigation System for the Visually Impaired , 2021, IEEE Transactions on Multimedia.

[66]  Kun Yu,et al.  DenseASPP for Semantic Segmentation in Street Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  Amy Hurst,et al.  "Pray before you step out": describing personal and situational blind navigation behaviors , 2013, ASSETS.

[68]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Meredith Ringel Morris,et al.  Closing the Gap: Designing for the Last-Few-Meters Wayfinding Problem for People with Visual Impairments , 2019, ASSETS.

[70]  Luc Van Gool,et al.  ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Chunhua Shen,et al.  Segmenting Transparent Objects in the Wild , 2020, ECCV.

[72]  Tatsuya Harada,et al.  Simultaneous Transparent and Non-Transparent Object Segmentation With Multispectral Scenes , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[73]  Ningbo Long,et al.  Unifying obstacle detection, recognition, and fusion based on millimeter wave radar and RGB-depth sensors for the visually impaired. , 2019, The Review of scientific instruments.

[74]  Linda G. Shapiro,et al.  ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Rainer Stiefelhagen,et al.  Helping the Blind to Get through COVID-19: Social Distancing Assistant Using Real-Time Semantic Segmentation on RGB-D Video , 2020, Sensors.

[76]  Gaofeng Meng,et al.  Enhanced Boundary Learning for Glass-like Object Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Tuan D. Pham,et al.  DUNet: A deformable network for retinal vessel segmentation , 2018, Knowl. Based Syst..

[79]  Sheng Tang,et al.  CGNet: A Light-Weight Context Guided Network for Semantic Segmentation , 2018, IEEE Transactions on Image Processing.

[80]  Luis Miguel Bergasa,et al.  Unifying terrain awareness through real-time semantic segmentation , 2018, 2018 IEEE Intelligent Vehicles Symposium (IV).

[81]  Patrick Pérez,et al.  The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces , 2015, CHI.

[82]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[83]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[84]  M. Santamouris,et al.  Passive and Low Energy Cooling for the Built Environment , 2011 .

[85]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[86]  B. Berenguel-Baeta,et al.  Floor Extraction and Door Detection for Visually Impaired Guidance , 2020, 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV).

[87]  Hong Liu,et al.  Expectation-Maximization Attention Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[88]  Gen Li,et al.  DABNet: Depth-wise Asymmetric Bottleneck for Real-time Semantic Segmentation , 2019, BMVC.

[89]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[90]  Mengyu Liu,et al.  Feature Pyramid Encoding Network for Real-time Semantic Segmentation , 2019, BMVC.

[91]  Sitong Wu,et al.  Fully Transformer Networks for Semantic Image Segmentation , 2021, ArXiv.

[92]  Xiang Bai,et al.  Asymmetric Non-Local Neural Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[93]  Anders Grunnet-Jepsen,et al.  Intel RealSense Stereoscopic Depth Cameras , 2017, CVPR 2017.

[94]  Xilin Chen,et al.  OCNet: Object Context for Semantic Segmentation , 2021, International Journal of Computer Vision.

[95]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[96]  F. M. Butera Glass architecture: is it sustainable? , 2005 .

[97]  R. Manduchi,et al.  Mobility-Related Accidents Experienced by People with Visual Impairment , 2010 .