Beyond Image to Depth: Improving Depth Prediction using Echoes

We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improved depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically, with experiments on Replica dataset, that the proposed method obtains 28% improvement in RMSE compared to the state-of-the-art audio-visual depth prediction method. To demonstrate the effectiveness of our method on larger dataset, we report competitive performance on Matterport3D, proposing to use it as a multimodal depth prediction benchmark with echoes for the first time. We also analyse the proposed method with exhaustive ablation experiments and qualitative results. The code and models are available at https://krantiparida.github.io/projects/bimgdepth.html

[1]  Leonidas J. Guibas,et al.  Bidirectional Estimators for Light Transport , 1995 .

[2]  S. Shimojo,et al.  When Sound Affects Vision: Effects of Auditory Grouping on Visual Motion Perception , 2001, Psychological science.

[3]  Paul Newman,et al.  Image and Sparse Laser Fusion for Dense Scene Reconstruction , 2009, FSR.

[4]  Xiaojin Gong,et al.  Guided Depth Enhancement via Anisotropic Diffusion , 2013, PCM.

[5]  Nicolas S. Holliman,et al.  3D sound and 3D image interactions: a review of audio-visual depth perception , 2014, Electronic Imaging.

[6]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pavel Zahorik,et al.  Auditory distance perception in humans: a review of cues, development, neuronal bases, and effects of sensory loss , 2015, Attention, perception & psychophysics.

[9]  Lore Thaler,et al.  Echolocation in humans: an overview. , 2016, Wiley interdisciplinary reviews. Cognitive science.

[10]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[11]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[13]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[14]  Rogério Schmidt Feris,et al.  Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.

[15]  Yinda Zhang,et al.  Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[17]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[18]  Takayuki Okatani,et al.  Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps With Accurate Object Boundaries , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Sertac Karaman,et al.  Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[20]  Tsung-Han Wu,et al.  Indoor Depth Completion with Boundary Consistency and Self-Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  Kris Kitani,et al.  Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[22]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[24]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  M. Pollefeys,et al.  DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Chuang Gan,et al.  The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Gaurav Sharma,et al.  Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[30]  Amlaan Bhoi,et al.  Monocular Depth Estimation: A Survey , 2019, ArXiv.

[31]  K. Grauman,et al.  SoundSpaces: Audio-Visual Navigation in 3D Environments , 2019, ECCV.

[32]  Kristen Grauman,et al.  VisualEchoes: Spatial Image Representation Learning through Echolocation , 2020, ECCV.

[33]  P. Maragos,et al.  STAViS: Spatio-Temporal AudioVisual Saliency Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Bingbing Zhuang,et al.  Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction , 2020, ECCV.

[35]  Luc Van Gool,et al.  Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds , 2020, ECCV.

[36]  Yan Wang,et al.  Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving , 2019, ICLR.

[37]  Stella X. Yu,et al.  BatVision: Learning to See 3D Spatial Layout with Two Ears , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Tanaya Guha,et al.  Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zero-shot Classification and Retrieval of Videos , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Sascha Hornauer,et al.  BatVision with GCC-PHAT Features for Better Sound to Vision Predictions , 2020, ArXiv.

[40]  Chuang Gan,et al.  Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yang Tang,et al.  Monocular depth estimation based on deep learning: An overview , 2020, Science China Technological Sciences.

[42]  Xin Li,et al.  Sparse-to-Dense Depth Completion Revisited: Sampling Strategy and Graph Construction , 2020, ECCV.

[43]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[44]  Yi Li,et al.  Learning Representations from Audio-Visual Spatial Alignment , 2020, NeurIPS.

[45]  Siddharth Srivastava,et al.  Exploiting Local Geometry for Feature and Graph Construction for Better 3D Point Cloud Processing with Graph Neural Networks , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Vinay P. Namboodiri,et al.  AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing Label Features from Multi-Modal Embeddings , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[47]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.