Neural Multisensory Scene Inference

For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration we have also developed a novel multi-sensory simulation environment for embodied agents.

[1]  L. Barsalou Grounded cognition. , 2008, Annual review of psychology.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Yee Whye Teh,et al.  Conditional Neural Processes , 2018, ICML.

[4]  Murray Shanahan,et al.  Consistent Generative Query Networks , 2018, ArXiv.

[5]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[6]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[7]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[8]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[9]  Aaron R. Seitz,et al.  Benefits of multisensory learning , 2008, Trends in Cognitive Sciences.

[10]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[11]  Ilker Yildirim,et al.  From Perception to Conception: Learning Multisensory Representations , 2014 .

[12]  Prithviraj Dasgupta,et al.  A Comprehensive Survey of Recent Trends in Cloud Robotics Architectures and Applications , 2018, Robotics.

[13]  James R. Glass,et al.  Disentangling by Partitioning: A Representation Learning Framework for Multimodal Sensory Data , 2018, ArXiv.

[14]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[15]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Micah M. Murray,et al.  The Neural Bases of Multisensory Processes , 2011 .

[18]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Vikash Kumar,et al.  MuJoCo HAPTIX: A virtual reality system for hand manipulation , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[20]  Daan Wierstra,et al.  Towards Conceptual Compression , 2016, NIPS.

[21]  Derek Hoiem,et al.  Pixels, Voxels, and Views: A Study of Shape Representations for Single View 3D Object Shape Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Tuan Anh Le,et al.  Empirical Evaluation of Neural Process Objectives , 2018 .

[23]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[24]  V. Ramachandran,et al.  The perception of phantom limbs. The D. O. Hebb lecture. , 1998, Brain : a journal of neurology.

[25]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[26]  Alexandre Pouget,et al.  Bayesian multisensory integration and cross-modal spatial links , 2004, Journal of Physiology-Paris.

[27]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[28]  R. Quiroga Concept cells: the building blocks of declarative memory functions , 2012, Nature Reviews Neuroscience.

[29]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[31]  Yee Whye Teh,et al.  Attentive Neural Processes , 2019, ICLR.

[32]  Stuart D. Harshbarger,et al.  An Overview of the Developmental Process for the Modular Prosthetic Limb , 2011 .

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Misha Denil,et al.  Learning Awareness Models , 2018, ICLR.

[35]  Fabio Viola,et al.  Learning models for visual 3D localization with implicit mapping , 2018, ArXiv.

[36]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[37]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[38]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[39]  Patrick van der Smagt,et al.  Multi-Source Neural Variational Inference , 2018, AAAI.

[40]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[41]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[42]  Á. Pascual-Leone,et al.  The metamodal organization of the brain. , 2001, Progress in brain research.

[43]  Uta Noppeney,et al.  Distinct Computational Principles Govern Multisensory Integration in Primary Sensory and Association Cortices , 2016, Current Biology.

[44]  Derek Nowrouzezahrai,et al.  Pix2Scene: Learning Implicit 3D Representations from Images , 2018 .