An Exploration of Embodied Visual Exploration

Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we first present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them. We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework with two photorealistic simulated 3D environments, a state-of-the-art exploration architecture, and diverse evaluation metrics. Our experimental results offer insights and suggest new performance metrics and baselines for future work in visual exploration. Code, models and data are publicly available: this https URL

[1]  Devendra Singh Chaplot,et al.  Modular Visual Navigation using Active Neural Mapping , 2019 .

[2]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[3]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[4]  Santhosh K. Ramakrishnan,et al.  Sidekick Policy Learning for Active Visual Exploration , 2018, ECCV.

[5]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[8]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[9]  Sonia Chernova,et al.  Are We Making Real Progress in Simulated Environments? Measuring the Sim2Real Gap in Embodied Visual Navigation , 2019, ArXiv.

[10]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[11]  Yiannis Aloimonos,et al.  Active vision , 2004, International Journal of Computer Vision.

[12]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[13]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[14]  Xinlei Chen,et al.  Embodied Visual Recognition , 2019, ArXiv.

[15]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[16]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Tinne Tuytelaars,et al.  Where to Look Next: Unsupervised Active Visual Exploration on 360° Input , 2019, ArXiv.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Yi Sun,et al.  Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[20]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[25]  Jitendra Malik,et al.  Generic 3D Representation via Pose Estimation and Matching , 2016, ECCV.

[26]  John K. Tsotsos,et al.  Active object recognition , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[28]  Javier R. Movellan,et al.  Deep Q-learning for Active Recognition of GERMS: Baseline performance on a standardized dataset for active learning , 2015, BMVC.

[29]  Alan Yuille,et al.  Active Vision , 2014, Computer Vision, A Reference Guide.

[30]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[31]  Jitendra Malik,et al.  Unifying Map and Landmark Based Representations for Visual Navigation , 2017, ArXiv.

[32]  Stefan Lee,et al.  Embodied Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  R. Bajcsy Active perception , 1988, Proc. IEEE.

[35]  Xinlei Chen,et al.  Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation , 2020, ECCV.

[36]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[39]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Jana Kosecka,et al.  A dataset for developing and benchmarking active vision , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[43]  Santhosh K. Ramakrishnan,et al.  Emergence of exploratory look-around behaviors through active observation completion , 2019, Science Robotics.

[44]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[45]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Wolfram Burgard,et al.  Information Gain-based Exploration Using Rao-Blackwellized Particle Filters , 2005, Robotics: Science and Systems.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Brian Yamauchi,et al.  A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[49]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[50]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[51]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[52]  Silvio Savarese,et al.  Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[56]  Dana H. Ballard,et al.  Animate Vision , 1991, Artif. Intell..

[57]  Abhinav Gupta,et al.  Beyond Games: Bringing Exploration to Robots in Real-world , 2018 .

[58]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[59]  Roozbeh Mottaghi,et al.  ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects , 2020, ArXiv.

[60]  James Bergstra,et al.  Benchmarking Reinforcement Learning Algorithms on Real-World Robots , 2018, CoRL.

[61]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[62]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[64]  Deepak Pathak,et al.  Self-Supervised Exploration via Disagreement , 2019, ICML.

[65]  Jürgen Schmidhuber,et al.  A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots , 2016, IEEE Robotics and Automation Letters.

[66]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[67]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[68]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[69]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[70]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[71]  Kristen Grauman,et al.  Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[72]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[73]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[74]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[75]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[76]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[77]  Kristen Grauman,et al.  End-to-End Policy Learning for Active Visual Categorization , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[79]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[80]  Deva Ramanan,et al.  Learning to Move with Affordance Maps , 2020, ICLR.