When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition

Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. In order to cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach significantly outperforms state-of-the-art methods both in object and scene recognition tasks.

[1]  Limin Wang,et al.  Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs. , 2017, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[2]  Markus Vincze,et al.  Recurrent Convolutional Fusion for RGB-D Object Recognition , 2018, IEEE Robotics and Automation Letters.

[3]  Xinhang Song,et al.  Learning Effective RGB-D Representations for Scene Recognition , 2018, IEEE Transactions on Image Processing.

[4]  Jordan B. Pollack,et al.  Recursive Distributed Representations , 1990, Artif. Intell..

[5]  Mohammed Bennamoun,et al.  A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[7]  Ming-Yu Liu,et al.  Recursive Context Propagation Network for Semantic Scene Labeling , 2014, NIPS.

[8]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[10]  Luis Herranz,et al.  Combining Models from Multiple Sources for RGB-D Scene Recognition , 2017, IJCAI.

[11]  Margaret Lech,et al.  Object Recognition Using Deep Convolutional Features Transformed by a Recursive Network Structure , 2016, IEEE Access.

[12]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[13]  Luis Herranz,et al.  Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs , 2017, AAAI.

[14]  Songfan Yang,et al.  Multi-scale Recognition with DAG-CNNs , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Anton van den Hengel,et al.  The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Qi Wang,et al.  MSN: Modality separation networks for RGB-D scene recognition , 2020, Neurocomputing.

[17]  Kaiqi Huang,et al.  Semi-supervised learning and feature evaluation for RGB-D object recognition , 2015, Comput. Vis. Image Underst..

[18]  Yuan Yuan,et al.  ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition , 2021, IEEE Transactions on Image Processing.

[19]  Lei Shi,et al.  Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Ke Lu,et al.  RGB-D object recognition with multimodal deep convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[21]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[24]  James M. Keller,et al.  Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[25]  Shijian Lu,et al.  Discriminative Multi-modal Feature Fusion for RGBD Indoor Scene Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Fabio Maria Carlucci,et al.  (DE)$^2$CO: Deep Depth Colorization , 2017, IEEE Robotics and Automation Letters.

[27]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chee Kheong Siew,et al.  Universal Approximation using Incremental Constructive Feedforward Networks with Random Hidden Nodes , 2006, IEEE Transactions on Neural Networks.

[29]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[30]  Ling Shao,et al.  RGB-D Scene Classification via Multi-modal Feature Learning , 2018, Cognitive Computation.

[31]  Deniz Yuret,et al.  RGB-D Object Recognition Using Deep Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Ahmet Burak Can,et al.  RGB-D Indoor Mapping Using Deep Features , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[35]  Ajmal S. Mian,et al.  Convolutional hypercube pyramid for accurate RGB-D object category and instance recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Yoh-Han Pao,et al.  Stochastic choice of basis functions in adaptive function approximation and the functional-link net , 1995, IEEE Trans. Neural Networks.

[37]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Gang Wang,et al.  Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition , 2015, IEEE Transactions on Multimedia.

[40]  Atsuto Maki,et al.  Factors of Transferability for a Generic ConvNet Representation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[42]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[44]  Javier Ruiz Hidalgo,et al.  Residual Attention Graph Convolutional Network for Geometric 3D Scene Classification , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[45]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Jiwen Lu,et al.  Modality and Component Aware Feature Fusion for RGB-D Scene Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Alan R. Wagner,et al.  Centroid Based Concept Learning for RGB-D Indoor Scene Classification , 2019, BMVC.

[48]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Javier Ruiz Hidalgo,et al.  2D-3D Geometric Fusion Network using Multi-Neighbourhood Graph Convolution for RGB-D Indoor Scene Classification , 2021, Inf. Fusion.

[50]  Ahmet Burak Can,et al.  Exploiting Multi-layer Features Using a CNN-RNN Approach for RGB-D Object Recognition , 2018, ECCV Workshops.

[51]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[52]  Mohammed Bennamoun,et al.  Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[53]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Xinhang Song,et al.  Image Representations With Spatial Object-to-Object Relations for RGB-D Scene Recognition , 2020, IEEE Transactions on Image Processing.

[55]  Qi Wang,et al.  ACM: Adaptive Cross-Modal Graph Convolutional Neural Networks for RGB-D Scene Recognition , 2019, AAAI.

[56]  Fuqiang Chen,et al.  Subset based deep learning for RGB-D object recognition , 2015, Neurocomputing.

[57]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[58]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[59]  Kui Jia,et al.  Canonical Correlation Analysis Regularization: An Effective Deep Multiview Learning Baseline for RGB-D Object Recognition , 2019, IEEE Transactions on Cognitive and Developmental Systems.

[60]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[61]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[62]  Y. Takefuji,et al.  Functional-link net computing: theory, system architecture, and functionalities , 1992, Computer.

[63]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[64]  Kaiqi Huang,et al.  Convolutional Fisher Kernels for RGB-D Object Recognition , 2015, 2015 International Conference on 3D Vision.

[65]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Mohammed Bennamoun,et al.  RGB-D Object Recognition and Grasp Detection Using Hierarchical Cascaded Forests , 2017, IEEE Transactions on Robotics.

[67]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[68]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[69]  Fuchun Sun,et al.  Multi-Modal Local Receptive Field Extreme Learning Machine for object recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[70]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[71]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[72]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Hironobu Fujiyoshi,et al.  Attention Branch Network: Learning of Attention Mechanism for Visual Explanation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Limin Wang,et al.  Cross-Modal Pyramid Translation for RGB-D Scene Recognition , 2021, International Journal of Computer Vision.

[75]  Jiwen Lu,et al.  MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[77]  Faisal Shafait,et al.  Viewpoint invariant semantic object and scene categorization with RGB-D sensors , 2018, Auton. Robots.

[78]  Kyoung Mu Lee,et al.  Deeply-Recursive Convolutional Network for Image Super-Resolution , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Xiao-Jing Wang,et al.  Internal Representation of Task Rules by Recurrent Dynamics: The Importance of the Diversity of Neural Responses , 2010, Front. Comput. Neurosci..

[80]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[81]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[82]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[83]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[84]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Tieniu Tan,et al.  MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification , 2019, Pattern Recognit..

[86]  Henry Leung,et al.  Private and common feature learning with adversarial network for RGBD object classification , 2021, Neurocomputing.

[87]  Tieniu Tan,et al.  DF2Net: Discriminative Feature Learning and Fusion Network for RGB-D Indoor Scene Classification , 2018, AAAI.

[88]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[89]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[90]  Kai Zhao,et al.  Translate-to-Recognize Networks for RGB-D Scene Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Geoffrey E. Hinton Mapping Part-Whole Hierarchies into Connectionist Networks , 1990, Artif. Intell..

[92]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[93]  Ajmal S. Mian,et al.  Learning a deeply supervised multi-modal RGB-D embedding for semantic scene and object category recognition , 2017, Robotics Auton. Syst..