Efforts towards robust visual scene understanding tend to rely heavily on manual annotations. When human labels are required, collecting a dataset large enough to train a successful robot vision system is almost certain to be prohibitively expensive. However, we argue that a robot with a vision sensor can learn powerful visual representations in a self-directed manner by relying on fundamental physical priors and bootstrapping techniques. For example, it has been shown that basic visual tracking systems can be used to automatically label short-range correspondences in video that allow one to train a system with capabilities analogous to object permanence in humans. An object permanence system can in turn be used to automatically label long-range correspondences, allowing one to train a system able to compare and contrast objects and scenes. In the end, the agent will develop a representation that encodes persistent material properties, state, lighting, etc. of various parts of a visual scene. Starting with a strong visual representation, the agent can then learn to solve traditional vision tasks such as class and/or instance recognition using only a sparse set of labels that can be found on the Internet or solicited at little cost from humans. More importantly, such a representation would also enable truly robust solutions to challenges in robotics such as global localization, loop closure detection, and object pose estimation.
[1]
Abhinav Gupta,et al.
Unsupervised Learning of Visual Representations Using Videos
,
2015,
2015 IEEE International Conference on Computer Vision (ICCV).
[2]
Jitendra Malik,et al.
Learning to See by Moving
,
2015,
2015 IEEE International Conference on Computer Vision (ICCV).
[3]
François Laviolette,et al.
Domain-Adversarial Training of Neural Networks
,
2015,
J. Mach. Learn. Res..
[4]
Alex Graves,et al.
Playing Atari with Deep Reinforcement Learning
,
2013,
ArXiv.
[5]
Sergey Levine,et al.
End-to-End Training of Deep Visuomotor Policies
,
2015,
J. Mach. Learn. Res..
[6]
Dieter Fox,et al.
Self-Supervised Visual Descriptor Learning for Dense Correspondence
,
2017,
IEEE Robotics and Automation Letters.