The Boombox: Visual Reconstruction from Acoustic Vibrations

We introduce The Boombox, a container that uses acoustic vibrations to reconstruct an image of its inside contents. When an object interacts with the container, they produce small acoustic vibrations. The exact vibration characteristics depend on the physical properties of the box and the object. We demonstrate how to use this incidental signal in order to predict visual structure. After learning, our approach remains effective even when a camera cannot view inside the box. Although we use low-cost and low-power contact microphones to detect the vibrations, our results show that learning from multi-modal data enables us to transform cheap acoustic sensors into rich visual sensors. Due to the ubiquity of containers, we believe integrating perception capabilities into them will enable new applications in human-computer interaction and robotics.

[1]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[3]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[4]  François Chaumette,et al.  Visual Servoing and Visual Tracking , 2008, Springer Handbook of Robotics.

[5]  Jiajun Wu,et al.  Generative Modeling of Audible Shapes for Object Perception , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Pieter Abbeel,et al.  BigBIRD: A large-scale 3D database of object instances , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Austin Reiter,et al.  Audiovisual Zooming: What You See Is What You Hear , 2019, ACM Multimedia.

[8]  Dinesh Manocha,et al.  Reflection-Aware Sound Source Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Ian Taylor,et al.  Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[11]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Lisa Feigenson,et al.  Tracking individuals via object-files: evidence from infants' manual search , 2003 .

[13]  Augusto Sarti,et al.  TDOA-based acoustic source localization in the space–range reference frame , 2014, Multidimens. Syst. Signal Process..

[14]  Deva Ramanan,et al.  TAO: A Large-Scale Benchmark for Tracking Any Object , 2020, ECCV.

[15]  Jiajun Wu,et al.  Shape and Material from Sound , 2017, NIPS.

[16]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.

[17]  Antonio Torralba,et al.  Through-Wall Human Pose Estimation Using Radio Signals , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  R. Baillargeon,et al.  2.5-Month-Old Infants' Reasoning about When Objects Should and Should Not Be Occluded , 1999, Cognitive Psychology.

[19]  L. Rayleigh,et al.  XII. On our perception of sound direction , 1907 .

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Matthias Nießner,et al.  3DMatch: Learning the Matching of Local 3D Geometry in Range Scans , 2016, ArXiv.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Qinghua Huang,et al.  Gaussian filter for TDOA based sound source localization in multimedia surveillance , 2018, Multimedia Tools and Applications.

[25]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[27]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  R. Brooks Planning Collision- Free Motions for Pick-and-Place Operations , 1983 .

[30]  Yashraj S. Narang,et al.  STReSSD: Sim-To-Real from Sound for Stochastic Dynamics , 2020, CoRL.

[31]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Frédo Durand,et al.  Turning Corners into Cameras: Principles and Methods , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[34]  Paul J. Besl,et al.  Method for registration of 3-D shapes , 1992, Other Conferences.

[35]  Martin Vetterli,et al.  Acoustic echoes reveal room shape , 2013, Proceedings of the National Academy of Sciences.

[36]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[37]  E. Malis Survey of vision-based robot control , 2002 .

[38]  Xudong Ma,et al.  Robust tracking of moving sound source using multiple model Kalman filter , 2008 .

[39]  Hod Lipson,et al.  Visual behavior modelling for robotic theory of mind , 2021, Scientific reports.

[40]  Dhiraj Gandhi,et al.  Swoosh! Rattle! Thump! - Actions that Sound , 2020, Robotics: Science and Systems.

[41]  Xiaogang Wang,et al.  Vision-Infused Deep Audio Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Felix Heide,et al.  Steady-State Non-Line-Of-Sight Imaging , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[46]  Shun-Po Chuang,et al.  Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Gordon Wetzstein,et al.  Acoustic Non-Line-Of-Sight Imaging , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Fanjun Bu,et al.  Object Permanence Through Audio-Visual Representations , 2021, IEEE Access.

[51]  J. C. Middlebrooks Sound localization. , 2015, Handbook of clinical neurology.

[52]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..