Image Enhancement and Translation for RGB-D Indoor Scene Recognition

Most existing methods for RGB-D indoor scene recognition adopt the backbone networks designed for image recognition. They overly focus on global features but largely ignore the local features, resulting in unsatisfactory accuracy in practice. This paper proposes a T-like network called T-Net to comprehensively exploit both global and local features by multi-scale supervision. In detail, we add an image translation branch and introduce pixel-level semantic segmentation annotations along with the image-level labels, to jointly supervise the model to excavate more regions of objects. In addition, low-quality source images without image enhancement cause difficulty in extracting representative features. To address this issue, the Multi-Scale Retinex with Color Restoration (MSRCR) is introduced to enhance the brightness and contrast of the RGB images. We demonstrate that the proposed method achieves superior performance to the state-of-the-art methods on SUN RGB-D and NYU Depth v2 Datasets.

[1]  Xinhang Song,et al.  Learning Effective RGB-D Representations for Scene Recognition , 2018, IEEE Transactions on Image Processing.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[6]  Tieniu Tan,et al.  DF2Net: Discriminative Feature Learning and Fusion Network for RGB-D Indoor Scene Classification , 2018, AAAI.

[7]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Zia-ur Rahman,et al.  Retinex processing for automatic image enhancement , 2002, IS&T/SPIE Electronic Imaging.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Shijian Lu,et al.  Discriminative Multi-modal Feature Fusion for RGBD Indoor Scene Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[13]  Xin Zhao,et al.  Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition , 2016, IJCAI.

[14]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[15]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Jiwen Lu,et al.  Modality and Component Aware Feature Fusion for RGB-D Scene Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[18]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Kai Zhao,et al.  Translate-to-Recognize Networks for RGB-D Scene Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.