Time-Contrastive Networks: Self-Supervised Learning from Video

We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a triplet loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at sermanet.github.io/imitate

[1]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[2]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[3]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[4]  W. Wong,et al.  On ψ-Learning , 2003 .

[5]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[6]  G. Rizzolatti,et al.  The mirror-neuron system. , 2004, Annual review of neuroscience.

[7]  M. Brass,et al.  Imitation: is cognitive neuroscience solving the correspondence problem? , 2005, Trends in Cognitive Sciences.

[8]  Aude Billard,et al.  On Learning, Representing, and Generalizing a Task in a Humanoid Robot , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Siddhartha S. Srinivasa,et al.  Imitation learning for locomotion and manipulation , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[13]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[14]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[15]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[16]  G. Rizzolatti,et al.  View-Based Encoding of Actions in Mirror Neurons of Area F5 in Macaque Premotor Cortex , 2011, Current Biology.

[17]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[18]  Siddhartha S. Srinivasa,et al.  Online customization of teleoperation interfaces , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[19]  Oliver Kroemer,et al.  Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[20]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[22]  Cordelia Schmid,et al.  Local Convolutional Features with Unsupervised Training for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Iasonas Kokkinos,et al.  Discriminative Learning of Deep Convolutional Feature Point Descriptors , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[27]  Sergey Levine,et al.  Learning Visual Feature Spaces for Robotic Manipulation with Deep Spatial Autoencoders , 2015, ArXiv.

[28]  Greg Mori,et al.  Pose Embeddings: A Deep Architecture for Learning to Match Human Poses , 2015, ArXiv.

[29]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Joshua B. Tenenbaum,et al.  Understanding Visual Concepts with Continuation Learning , 2016, ArXiv.

[33]  Jitendra Malik,et al.  Generic 3D Representation via Pose Estimation and Matching , 2016, ECCV.

[34]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[35]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[36]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[37]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Gustavo Carneiro,et al.  Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[41]  Stefano Ermon,et al.  Label-Free Supervision of Neural Networks with Physics and Domain Knowledge , 2016, AAAI.

[42]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[44]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[46]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Sergey Levine,et al.  Unsupervised Perceptual Rewards for Imitation Learning , 2016, Robotics: Science and Systems.

[48]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  S. Levine,et al.  Time-Contrastive Networks : Self-Supervised Learning from Pixels , 2017 .

[50]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[51]  Sergey Levine,et al.  Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).