Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conventional approach to solving "vision" is to define a set of offline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the recent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for performing arbitrary downstream active tasks? We show that proper use of mid-level perception confers significant advantages over training from scratch. We implement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realizing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.

[1]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..

[2]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[3]  Lorien Y. Pratt,et al.  Discriminability-Based Transfer between Neural Networks , 1992, NIPS.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[7]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[8]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[9]  Raymond J. Mooney,et al.  Mapping and Revising Markov Logic Networks for Transfer Learning , 2007, AAAI.

[10]  Daniel L. Silver,et al.  Guest editor’s introduction: special issue on inductive transfer learning , 2008, Machine Learning.

[11]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[12]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[13]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[14]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[15]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Yu Zhong,et al.  Intrinsic shape signatures: A shape descriptor for 3D object recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[17]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[18]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[21]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[22]  Rich Caruana,et al.  Inductive Transfer for Bayesian Network Structure Learning , 2007, ICML Unsupervised and Transfer Learning.

[23]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[24]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[26]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[27]  Trevor Darrell,et al.  Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[30]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[31]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[32]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[34]  J. Peirce Understanding mid-level representations in visual processing. , 2015, Journal of vision.

[35]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[39]  Oussama Khatib,et al.  Springer Handbook of Robotics , 2007, Springer Handbooks.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jitendra Malik,et al.  Generic 3D Representation via Pose Estimation and Matching , 2016, ECCV.

[42]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[43]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[44]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[45]  Jürgen Schmidhuber,et al.  A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots , 2016, IEEE Robotics and Automation Letters.

[46]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[47]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[49]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[50]  Robert Babuska,et al.  Learning state representation for deep actor-critic control , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[51]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[52]  Augustus Odena,et al.  Semi-Supervised Learning with Generative Adversarial Networks , 2016, ArXiv.

[53]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[54]  Sergey Levine,et al.  Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[56]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[57]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[58]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[59]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[62]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[63]  Paolo Favaro,et al.  Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[65]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[66]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Justin Fu,et al.  EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[68]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[69]  Christopher Burgess,et al.  DARLA: Improving Zero-Shot Transfer in Reinforcement Learning , 2017, ICML.

[70]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[71]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[72]  Fei-Fei Li,et al.  Label Efficient Learning of Transferable Representations acrosss Domains and Tasks , 2017, NIPS.

[73]  Trevor Darrell,et al.  Loss is its own Reward: Self-Supervision for Reinforcement Learning , 2016, ICLR.

[74]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[75]  Sergey Levine,et al.  One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks , 2018, ArXiv.

[76]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[77]  Sergey Levine,et al.  Deep Object-Centric Representations for Generalizable Robot Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[78]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[80]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[81]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[82]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[83]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[84]  David Filliat,et al.  S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning , 2018, ArXiv.

[85]  Rouhollah Rahmatizadeh,et al.  Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[86]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[87]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[88]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Allan Jabri,et al.  Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control , 2018, ICML.

[90]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[91]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[93]  Jana Kosecka,et al.  Visual Representations for Semantic Target Driven Navigation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[94]  Bo Wang,et al.  Moment Matching for Multi-Source Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[95]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Sergey Levine,et al.  Generalization through Simulation: Integrating Simulated and Real Data into Deep Reinforcement Learning for Vision-Based Autonomous Flight , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[97]  A. Gupta,et al.  Learning Exploration Policies for Navigation , 2019, International Conference on Learning Representations.

[98]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[99]  Ali Farhadi,et al.  Visual Semantic Navigation using Scene Priors , 2018, ICLR.