Shape and Material from Sound

Hearing an object falling onto the ground, humans can recover rich information including its rough shape, material, and falling height. In this paper, we build machines to approximate such competency. We first mimic human knowledge of the physical world by building an efficient, physics-based simulation engine. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further accelerate the process by learning a mapping from a sound wave to object properties, and using the predicted values to initialize the inference. This mapping can be viewed as an approximation of human commonsense learned from past experience. Our model performs well on both synthetic audio clips and real recordings without requiring any annotated data. We conduct behavior studies to compare human responses with ours on estimating object shape, material, and falling height from sound. Our model achieves near-human performance.

[1]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[2]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[3]  M. Turvey,et al.  Hearing shape. , 2000, Journal of experimental psychology. Human perception and performance.

[4]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[5]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[6]  Dinesh K. Pai,et al.  Perception of Material from Contact Sounds , 2000, Presence: Teleoperators & Virtual Environments.

[7]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.

[8]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[9]  Eero P. Simoncelli,et al.  Summary statistics in auditory perception , 2013, Nature Neuroscience.

[10]  Davide Rocchesso,et al.  The Sounding Object , 2002 .

[11]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[12]  Joshua B. Tenenbaum,et al.  Black boxes: Hypothesis testing via indirect perceptual evidence , 2014, CogSci.

[13]  Dinesh K. Pai,et al.  The Sounds of Physical Shapes , 1998, Presence.

[14]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[15]  Vikash K. Mansinghka,et al.  Reconciling intuitive physics and Newtonian mechanics for colliding objects. , 2013, Psychological review.

[16]  Jiajun Wu,et al.  Generative Modeling of Audible Shapes for Object Perception , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[19]  David Poeppel,et al.  Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision , 2010, Biolinguistics.

[20]  George Drettakis,et al.  Fast modal sounds with scalable frequency-domain synthesis , 2008, ACM Trans. Graph..

[21]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[22]  James F. O'Brien,et al.  Synthesizing Sounds from Physically Based Motion , 2001, SIGGRAPH Video Review on Animation Theater Program.

[23]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[24]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[25]  Chen Shen,et al.  Synthesizing sounds from rigid-body simulations , 2002, SCA '02.

[26]  Dinesh K. Pai,et al.  Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources , 2006, SIGGRAPH 2006.