Modeling the visual world: reconstruction and neural episodic representation

Modeling the visual world is the cornerstone of solving many important problems in engineering and science. Those problems include augmented/virtual reality (AR/VR), artificial intelligence (AI) and robotics. The goal of modeling is usually to distill the perceived world into a form which is suitable for solving a range of tasks. Over the years of research, multiple approaches to modeling the world were proposed. Currently, three most widely used approaches are: explicitly reconstructing the world from images, keeping the knowledge implicitly as weights of a neural network and representing the world as a collection of neural embedding instances. The first approach relies on human geometric priors about the process by which the images were obtained. Those priors provide geometric constraints which allow to determine the 3D model which is consistent with the observed evidence. The second and the third approaches are based on learning — that is, fitting the weights of a parametric function while solving some human-designed optimization problem. The second approach does not keep explicit memory of individual instances of experience — all the information is distilled into the weights. The third approach preserves the instances of interaction with the world as individual embedding instances. Embedding neural network extracts those instances from experience and other neural networks operate on them — for example, to determine similarities between pairs of experiences. In this thesis, we analyze different modeling approaches and make three novel contributions to them. First, we propose a very accurate reconstruction method called Ray Po-

[1]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[2]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[3]  Gordon Wyeth,et al.  SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights , 2012, 2012 IEEE International Conference on Robotics and Automation.

[4]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[5]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[7]  Roberto Cipolla,et al.  Multi-view stereo via volumetric graph-cuts , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  C. Zach Fast and High Quality Fusion of Depth Maps , 2008 .

[9]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[10]  Roberto Cipolla,et al.  An Image-Based System for Urban Navigation , 2004, BMVC.

[11]  Benjamin Kuipers,et al.  A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations , 1991, Robotics Auton. Syst..

[12]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[13]  Gordon Wyeth,et al.  Persistent Navigation and Mapping using a Biologically Inspired SLAM System , 2010, Int. J. Robotics Res..

[14]  E. Spelke,et al.  Human Spatial Representation: Insights from Animals , 2002 .

[15]  Barbara Solenthaler,et al.  Data-driven fluid simulations using regression forests , 2015, ACM Trans. Graph..

[16]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[17]  Thomas Brox,et al.  Point-Based 3D Reconstruction of Thin Objects , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[19]  Vladimir Kolmogorov,et al.  What metrics can be approximated by geo-cuts, or global optimization of length/area and flux , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  S. Levine,et al.  Time-Contrastive Networks : Self-Supervised Learning from Pixels , 2017 .

[21]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[22]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[23]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[24]  Kenneth O. Stanley,et al.  Abandoning Objectives: Evolution Through the Search for Novelty Alone , 2011, Evolutionary Computation.

[25]  Silvio Savarese,et al.  Deep View Morphing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Vladimir Kolmogorov,et al.  Multi-camera Scene Reconstruction via Graph Cuts , 2002, ECCV.

[28]  Vladlen Koltun,et al.  Benchmarking Classic and Learned Navigation in Complex 3D Environments , 2019, ArXiv.

[29]  Pascal Fua,et al.  On benchmarking camera calibration and multi-view stereo for high resolution imagery , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Pierre-Yves Oudeyer,et al.  Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration , 2018, ICLR.

[31]  Edward M. Riseman,et al.  Image-based homing , 1992 .

[32]  Jean-Philippe Pons,et al.  Towards high-resolution large-scale multi-view stereo , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[34]  Daniel Cremers,et al.  Integration of Multiview Stereo and Silhouettes Via Convex Functionals on Convex Domains , 2008, ECCV.

[35]  Jean-Philippe Pons,et al.  High Accuracy and Visibility-Consistent Dense Multiview Stereo , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[37]  C. Fraser,et al.  Accurate and occlusion-robust multi-view stereo , 2015 .

[38]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[39]  Antonin Chambolle,et al.  Diagonal preconditioning for first order primal-dual algorithms in convex optimization , 2011, 2011 International Conference on Computer Vision.

[40]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[41]  Jan-Michael Frahm,et al.  Fast Global Labeling for Real-Time Stereo Using Multiple Plane Sweeps , 2008, VMV.

[42]  Roberto Cipolla,et al.  Probabilistic visibility for multi-view stereo , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Lei Zhang,et al.  Detail-Preserving and Content-Aware Variational Multi-View Stereo Reconstruction , 2015, IEEE Transactions on Image Processing.

[45]  Marc Pollefeys,et al.  Turning Mobile Phones into 3D Scanners , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[48]  Hugo Larochelle,et al.  Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples , 2019, ICLR.

[49]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[50]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[51]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[52]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[53]  Johannes L. Schönberger,et al.  Supplementary Material for A MultiView Stereo Benchmark with High-Resolution Images and Multi-Camera Videos , 2017 .

[54]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[55]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[56]  R. Passingham The hippocampus as a cognitive map J. O'Keefe & L. Nadel, Oxford University Press, Oxford (1978). 570 pp., £25.00 , 1979, Neuroscience.

[57]  Mário A. T. Figueiredo,et al.  Generalized Multi-Camera Scene Reconstruction Using Graph Cuts , 2003 .

[58]  Sebastian Thrun,et al.  Learning Metric-Topological Maps for Indoor Mobile Robot Navigation , 1998, Artif. Intell..

[59]  Daniel Cremers,et al.  Generalized ordering constraints for multilabel optimization , 2011, 2011 International Conference on Computer Vision.

[60]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[61]  M. Pollefeys Self-calibration and metric 3d reconstruction from uncalibrated image sequences , 1999 .

[62]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[63]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[64]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[65]  Daniel Cremers,et al.  Propagated Photoconsistency and Convexity in Variational Multiview 3D Reconstruction , 2007 .

[66]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[67]  Silvio Savarese,et al.  3D Scene Understanding by Voxel-CRF , 2013, 2013 IEEE International Conference on Computer Vision.

[68]  R. Sutherland,et al.  Some limitations on the use of distal cues in place navigation by rats , 1987, Psychobiology.

[69]  Victor S. Lempitsky,et al.  Global Optimization for Shape Fitting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Philip H. S. Torr,et al.  Efficient Minimization of Higher Order Submodular Functions using Monotonic Boolean Functions , 2011, Discret. Appl. Math..

[71]  Avinash C. Kak,et al.  NEURO-NAV: a neural network based architecture for vision-guided mobile robot navigation using non-metrical models of the environment , 1993, [1993] Proceedings IEEE International Conference on Robotics and Automation.

[72]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[73]  Gordon Wyeth,et al.  Mapping a Suburb With a Single Camera Using a Biologically Inspired SLAM System , 2008, IEEE Transactions on Robotics.

[74]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[75]  Hendrik P. A. Lensch,et al.  Multi-View Depth Map Estimation With Cross-View Consistency , 2014, BMVC.

[76]  Bernhard Schölkopf,et al.  View-Based Cognitive Mapping and Path Planning , 1995, Adapt. Behav..

[77]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[78]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[79]  Johannes L. Schönberger Robust Methods for Accurate and Efficient 3D Modeling from Unstructured Imagery , 2018 .

[80]  ARNO KNAPITSCH,et al.  Tanks and temples , 2017, ACM Trans. Graph..

[81]  Marc Pollefeys,et al.  Discrete optimization of ray potentials for semantic 3D reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Shubao Liu,et al.  A complete statistical inverse ray tracing approach to multi-view stereo , 2011, CVPR 2011.

[83]  Daniel Cremers,et al.  An Experimental Comparison of Discrete and Continuous Shape Optimization Methods , 2008, ECCV.

[84]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[85]  Marc Pollefeys,et al.  Multi-View Stereo via Graph Cuts on the Dual of an Adaptive Tetrahedral Mesh , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[86]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[87]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  Michael J. Black,et al.  Towards Probabilistic Volumetric Reconstruction Using Ray Potentials , 2015, 2015 International Conference on 3D Vision.

[89]  T. Collett,et al.  Multiple stored views and landmark guidance in ants , 1998, Nature.

[90]  Bartunov Sergey,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016 .

[91]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[92]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[93]  Jean-Arcady Meyer,et al.  BIOLOGICALLY BASED ARTIFICIAL NAVIGATION SYSTEMS: REVIEW AND PROSPECTS , 1997, Progress in Neurobiology.

[94]  N. Mackintosh Do not ask whether they have a cognitive map , but how they find their way about , 2002 .

[95]  Vladimir Kolmogorov,et al.  On partial optimality in multi-label MRFs , 2008, ICML '08.

[96]  Horst Bischof,et al.  A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[97]  Marc Pollefeys,et al.  What is optimized in convex relaxations for multilabel problems: connecting discrete and continuously inspired MAP inference. , 2014, IEEE transactions on pattern analysis and machine intelligence.

[98]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[99]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[100]  Joseph L. Mundy,et al.  Change Detection in a 3-d World , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[102]  Jean-Philippe Pons,et al.  Efficient Multi-View Reconstruction of Large-Scale Scenes using Interest Points, Delaunay Triangulation and Graph Cuts , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[103]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[104]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[105]  Alexei A. Efros,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Shubao Liu,et al.  Ray Markov Random Fields for image-based 3D modeling: Model and efficient inference , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[107]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[108]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[109]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[110]  A. P. Hyper-parameters Count-Based Exploration with Neural Density Models , 2017 .

[111]  D. Cooper,et al.  Statistical Inverse Ray Tracing for Image-Based 3D Modeling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  Wolfram Burgard,et al.  Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .