A Self-supervised Learning System for Object Detection in Videos Using Random Walks on Graphs

This paper presents a new self-supervised system for learning to detect novel and previously unseen categories of objects in images. The proposed system receives as input several unlabeled videos of scenes containing various objects. The frames of the videos are segmented into objects using depth information, and the segments are tracked along each video. The system then constructs a weighted graph that connects sequences based on the similarities between the objects that they contain. The similarity between two sequences of objects is measured by using generic visual features, after automatically re-arranging the frames in the two sequences to align the viewpoints of the objects. The graph is used to sample triplets of similar and dissimilar examples by performing random walks. The triplet examples are finally used to train a siamese neural network that projects the generic visual features into a low-dimensional manifold. Experiments on three public datasets, YCB-Video, CORe50 and RGBD-Object, show that the projected low-dimensional features improve the accuracy of clustering unknown objects into novel categories, and outperform several recent unsupervised clustering techniques.

[1]  Kostas E. Bekris,et al.  Physics-based scene-level reasoning for object pose estimation in clutter , 2018, Int. J. Robotics Res..

[2]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[3]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Roberto Javier López-Sastre,et al.  Unsupervised learning from videos using temporal coherency deep networks , 2018, Comput. Vis. Image Underst..

[5]  Kuan-Ting Yu,et al.  Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[7]  Martijn Wisse,et al.  Team Delft's Robot Winner of the Amazon Picking Challenge 2016 , 2016, RoboCup.

[8]  Kostas E. Bekris,et al.  Improving 6D Pose Estimation of Objects in Clutter Via Physics-Aware Monte Carlo Tree Search , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Philippe Giguère,et al.  Learning Object Localization and 6D Pose Estimation from Simulation and Weakly Labeled Real Images , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[10]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Cheng Deng,et al.  Deep Clustering via Joint Convolutional Autoencoder Embedding and Relative Entropy Minimization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Gang Chen,et al.  Deep Learning with Nonparametric Clustering , 2015, ArXiv.

[13]  Kostas E. Bekris,et al.  Task-Driven Perception and Manipulation for Constrained Placement of Unknown Objects , 2020, IEEE Robotics and Automation Letters.

[14]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[15]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[16]  Bo Yang,et al.  Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering , 2016, ICML.

[17]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[18]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[19]  S. Gelly,et al.  Self-Supervised Learning of Video-Induced Visual Invariances , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[21]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[22]  Feng Liu,et al.  Auto-encoder Based Data Clustering , 2013, CIARP.

[23]  Olivier Gibaru,et al.  CNN features are also great at unsupervised classification , 2017, ArXiv.

[24]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[25]  Robby T. Tan,et al.  Neural Clustering: Concatenating Layers for Better Projections , 2017 .

[26]  Sebastian Thrun,et al.  Unsupervised learning of invariant features using video , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Kostas E. Bekris,et al.  Robust 6D Object Pose Estimation with Stochastic Congruent Sets , 2018, BMVC.

[28]  Wei Wang,et al.  Deep Embedding Network for Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[29]  Baile Xu,et al.  Image Clustering via Deep Embedded Dimensionality Reduction and Probability-Based Triplet Loss , 2020, IEEE Transactions on Image Processing.

[30]  Kostas E. Bekris,et al.  Towards Robust Product Packing with a Minimalistic End-Effector , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[31]  Kostas E. Bekris,et al.  A self-supervised learning system for object detection using physics simulation and multi-view pose estimation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[33]  Davide Maltoni,et al.  CORe50: a New Dataset and Benchmark for Continuous Object Recognition , 2017, CoRL.

[34]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[35]  Chia-Wen Lin,et al.  CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data , 2017, IEEE Transactions on Multimedia.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Abdeslam Boularias,et al.  Learning to Manipulate Unknown Objects in Clutter by Reinforcement , 2015, AAAI.

[39]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[40]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[41]  Abdeslam Boularias,et al.  Scene-level Pose Estimation for Multiple Instances of Densely Packed Objects , 2019, CoRL.

[42]  Alexandre d'Aspremont,et al.  Clustering and feature selection using sparse principal component analysis , 2007, ArXiv.

[43]  Bo Zhang,et al.  Discriminatively Boosted Image Clustering with Fully Convolutional Auto-Encoders , 2017, Pattern Recognit..

[44]  Fadime Sener,et al.  Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[46]  Abhinav Gupta,et al.  Transitive Invariance for Self-Supervised Visual Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).