Feature learning for RGB-D scene understanding

Scene understanding is an important and fundamental problem in computer vision and is critical in applications of robotics and augmented reality. Scene understanding includes many tasks such as scene labeling, object recognition and scene classification. Most previous scene understanding methods focus on outdoor scenes. In contrast, indoor scene understanding is more challenging, due to poor illumination and cluttered objects. With the wide availability of affordable RGB-D cameras such as Kinect, huge changes have been made to indoor scene analysis due to the rich 3D geometry information provided by depth measurements. Feature extraction is the key part for scene understanding tasks. Most of the early methods extract hand-crafted features. However, the performance of such feature extractors highly depends on variations in hand-crafting and combinations. The designing process requires empirical understanding of data, thus hard to systematically extend to different modalities. In addition, the hand-crafted features usually capture a subset of recognition cues from raw data, which might ignore some useful information. Thus, in this research, we focus on feature learning with raw data as input. Particularly, we explore feature learning on three different tasks of indoor scene understanding using RGB-D input: • Scene labeling: The aim is to densely assign a category label (e.g. table, TV) to each pixel in an image. Inspired by the success of unsupervised feature learning, we start by adapting the existing unsupervised feature learning technique to directly learn features from RGB-D images. Typically, better performance could be achieved by further applying feature encoding over the learned features to build

[1]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[2]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[4]  Gang Wang,et al.  Learning Discriminative and Shareable Features for Scene Classification , 2014, ECCV.

[5]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[6]  Fuchun Sun,et al.  Unsupervised multimodal feature learning for semantic image segmentation , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[7]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[8]  Jana Kosecka,et al.  Semantic segmentation of street scenes by superpixel co-occurrence and 3D geometry , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[9]  Cristian Sminchisescu,et al.  Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[11]  Dieter Fox,et al.  Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms , 2011, NIPS.

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  César Cadena,et al.  Semantic Parsing for Priming Object Detection in RGB-D Scenes , 2013 .

[15]  Yi Li,et al.  Convolutional Neural Networks for No-Reference Image Quality Assessment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  L. Bottou,et al.  Deep Convolutional Networks for Scene Parsing , 2009 .

[17]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[18]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[19]  In-So Kweon,et al.  Fisher Kernel for Deep Neural Activations , 2014, ArXiv.

[20]  Meng Wang,et al.  Deep Learning of Scene-Specific Classifier for Pedestrian Detection , 2014, ECCV.

[21]  Jitendra Malik,et al.  Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation , 2015, International Journal of Computer Vision.

[22]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[23]  Heinrich H. Bülthoff,et al.  Going into depth: Evaluating 2D and 3D cues for object classification on a new, large-scale object dataset , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[24]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[25]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[27]  Wei-Shi Zheng,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Mordecai Avriel,et al.  Nonlinear programming , 1976 .

[30]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[31]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[32]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[33]  Thomas S. Huang,et al.  Image Super-Resolution Via Sparse Representation , 2010, IEEE Transactions on Image Processing.

[34]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Rong Jin,et al.  Exclusive Lasso for Multi-task Feature Selection , 2010, AISTATS.

[37]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[38]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[39]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[41]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Dieter Fox,et al.  Unsupervised feature learning for 3D scene labeling , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Shannon L. Risacher,et al.  Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning , 2012, Bioinform..

[45]  Jianfei Cai,et al.  Can Partial Strong Labels Boost Multi-label Object Recognition? , 2015, ArXiv.

[46]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[47]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[49]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[50]  Brian C. Lovell,et al.  Object tracking via non-Euclidean geometry: A Grassmann approach , 2014, IEEE Winter Conference on Applications of Computer Vision.

[51]  Yann LeCun,et al.  Convolutional neural networks applied to house numbers digit classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[52]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[53]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[55]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[56]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[57]  Shannon L. Risacher,et al.  Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance , 2011, 2011 International Conference on Computer Vision.

[58]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[60]  Andrew E. Johnson,et al.  Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[62]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Trevor Darrell,et al.  Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[64]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[65]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[66]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[67]  S. Foix,et al.  Lock-in Time-of-Flight (ToF) Cameras: A Survey , 2011, IEEE Sensors Journal.

[68]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[69]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, ICCV Workshops.

[70]  Gang Hua,et al.  Can Visual Recognition Benefit from Auxiliary Information in Training? , 2014, ACCV.

[71]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[73]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[74]  Sven Behnke,et al.  Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[75]  Mohammed Bennamoun,et al.  Geometry Driven Semantic Labeling of Indoor Scenes , 2014, ECCV.

[76]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[77]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[79]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[80]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[81]  Andrew Zisserman,et al.  Pylon Model for Semantic Segmentation , 2011, NIPS.

[82]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[83]  Andrew Y. Ng,et al.  Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[84]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[85]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[86]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[87]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[88]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[89]  Quoc V. Le,et al.  ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning , 2011, NIPS.

[90]  Allen Y. Yang,et al.  Informative feature selection for object recognition via Sparse PCA , 2011, 2011 International Conference on Computer Vision.

[91]  Dieter Fox,et al.  Depth kernel descriptors for object recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[92]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[93]  Samuel Kaski,et al.  Probabilistic approach to detecting dependencies between data sets , 2008, Neurocomputing.

[94]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[95]  Serge J. Belongie,et al.  Object categorization using co-occurrence, location and appearance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[96]  Aly A. Farag,et al.  CSIFT: A SIFT Descriptor with Color Invariant Characteristics , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[97]  Baba C. Vemuri,et al.  On A Nonlinear Generalization of Sparse Coding and Dictionary Learning , 2013, ICML.

[98]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[99]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Image Categorization , 2015, ArXiv.

[100]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[101]  Feiping Nie,et al.  Exclusive Feature Learning on Arbitrary Structures via \ell_{1, 2}-norm , 2014, NIPS.

[102]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[103]  Jing Liu,et al.  Partially Shared Latent Factor Learning With Multiview Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[104]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[105]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[106]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[107]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[108]  Lei Shi,et al.  Understand scene categories by objects: A semantic regularized scene classifier using Convolutional Neural Networks , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[109]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[110]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[111]  Brian C. Lovell,et al.  Sparse Coding and Dictionary Learning for Symmetric Positive Definite Matrices: A Kernel Approach , 2012, ECCV.