Robust Multiview Feature Learning for RGB-D Image Understanding

The availability of massive RGB-depth (RGB-D) images poses a compelling need for effective RGB-D content understanding techniques. RGB-D images provide synchronized information from multiple views (e.g., color and depth) of real-world objects and scenes. This work proposes learning compact and discriminative features from the multiple views of RGB-D content toward effective feature representation for RGB-D image understanding. In particular, a robust multiview feature learning approach is developed, which exploits the intrinsic relations among multiple views. The feature learning in multiple views is jointly optimized in an integrated formulation. The joint optimization essentially exploits the intrinsic relations among the views, leading to effective features and making the learning process robust to noises. The feature learning function is formulated as a robust nonnegative graph embedding function over multiple graphs in various views. The graphs characterize the local geometric and discriminating structure of the multiview data. The joint sparsity in ℓ1-norm graph embedding and ℓ21-norm data factorization further enhances the robustness of feature learning. We derive an efficient computational solution for the proposed approach and provide rigorous theoretical proof with regard to its convergence. We apply the proposed approach to two RGB-D image understanding tasks: RGB-D object classification and RGB-D scene categorization. We conduct extensive experiments on two real-world RGB-D image datasets. The experimental results have demonstrated the effectiveness of the proposed approach.

[1]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[2]  Meng Wang,et al.  Robust Non-negative Graph Embedding: Towards noisy data, unreliable graphs, and noisy labels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[4]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time dynamic 3D surface reconstruction and interaction , 2011, SIGGRAPH '11.

[5]  Hai Jin,et al.  Projective Nonnegative Graph Embedding , 2010, IEEE Transactions on Image Processing.

[6]  Luis Salgado,et al.  Depth-Color Fusion Strategy for 3-D Scene Modeling With Kinect , 2013, IEEE Transactions on Cybernetics.

[7]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[9]  Hung-Khoon Tan,et al.  Beyond search: Event-driven summarization for web videos , 2011, TOMCCAP.

[10]  Yong Luo,et al.  Manifold Regularized Multitask Learning for Semi-Supervised Multilabel Image Classification , 2013, IEEE Transactions on Image Processing.

[11]  Tao Mei,et al.  Graph-based semi-supervised learning with multiple labels , 2009, J. Vis. Commun. Image Represent..

[12]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[13]  Nicu Sebe,et al.  Discriminating Joint Feature Analysis for Multimedia Data Understanding , 2012, IEEE Transactions on Multimedia.

[14]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .

[16]  Ling Shao,et al.  Feature Learning for Image Classification Via Multiobjective Genetic Programming , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[17]  James M. Keller,et al.  Histogram of Oriented Normal Vectors for Object Recognition with a Depth Sensor , 2012, ACCV.

[18]  Juho Kannala,et al.  Joint Depth and Color Camera Calibration with Distortion Correction , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[20]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[21]  Antonis A. Argyros,et al.  Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Meng Wang,et al.  Multimedia Question Answering , 2010, IEEE MultiMedia.

[23]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Gang Hua,et al.  Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications , 2011, IEEE Transactions on Image Processing.

[25]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[26]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[27]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[28]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, ICCV Workshops.

[29]  Chris H. Q. Ding,et al.  Robust nonnegative matrix factorization using L21-norm , 2011, CIKM '11.

[30]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[31]  Ramesh C. Jain,et al.  Image annotation by kNN-sparse graph-based label propagation over noisily tagged web images , 2011, TIST.