Generative Modeling of Audible Shapes for Object Perception

Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audiovisual data, purely from 3D object shapes and their physical properties. Through comparison with audio recordings and human behavioral studies, we validate the accuracy of the sounds it generates. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We demonstrate that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.

[1]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Demetri Terzopoulos,et al.  Constraints on Deformable Models: Recovering 3D Shape and Nonrigid Motion , 1988, Artif. Intell..

[4]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[5]  R. D. Ciskowski,et al.  Boundary element methods in acoustics , 1991 .

[6]  Richard Szeliski,et al.  Recovering 3D shape and motion from image streams using nonlinear least squares , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Dinesh K. Pai,et al.  The Sounds of Physical Shapes , 1998, Presence.

[8]  Dinesh K. Pai,et al.  Perception of Material from Contact Sounds , 2000, Presence: Teleoperators & Virtual Environments.

[9]  M. Turvey,et al.  Hearing shape. , 2000, Journal of experimental psychology. Human perception and performance.

[10]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[11]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[12]  James F. O'Brien,et al.  Synthesizing Sounds from Physically Based Motion , 2001, SIGGRAPH Video Review on Animation Theater Program.

[13]  Davide Rocchesso,et al.  The Sounding Object , 2002 .

[14]  Chen Shen,et al.  Synthesizing sounds from rigid-body simulations , 2002, SCA '02.

[15]  Davide Rocchesso,et al.  Sounding Objects , 2003, IEEE Multim..

[16]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[17]  Dinesh K. Pai,et al.  Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources , 2006, SIGGRAPH 2006.

[18]  J. Shewchuk,et al.  Isosurface stuffing: fast tetrahedral meshes with good dihedral angles , 2007, SIGGRAPH 2007.

[19]  George Drettakis,et al.  Fast modal sounds with scalable frequency-domain synthesis , 2008, ACM Trans. Graph..

[20]  Yijun Liu Fast Multipole Boundary Element Method: Theory and Applications in Engineering , 2009 .

[21]  Edward H. Adelson,et al.  Exploring features in a Bayesian framework for material recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[23]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Eero P. Simoncelli,et al.  Summary statistics in auditory perception , 2013, Nature Neuroscience.

[26]  Katsushi Ikeuchi,et al.  Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Joshua B. Tenenbaum,et al.  Black boxes: Hypothesis testing via indirect perceptual evidence , 2014, CogSci.

[28]  P. Fiala,et al.  NiHu: An open source C++ BEM library , 2014, Adv. Eng. Softw..

[29]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[32]  Jiajun Wu,et al.  Physics 101: Learning Physical Object Properties from Unlabeled Videos , 2016, BMVC.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jitendra Malik,et al.  Learning Visual Predictive Models of Physics for Playing Billiards , 2015, ICLR.

[35]  Ali Farhadi,et al.  Newtonian Image Understanding: Unfolding the Dynamics of Objects in Static Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[37]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[38]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[39]  Abhinav Gupta,et al.  3D Shape Attributes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[41]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Ali Farhadi,et al.  "What Happens If..." Learning to Predict the Effect of Forces in Images , 2016, ECCV.

[43]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Joshua B. Tenenbaum,et al.  A Compositional Object-Based Approach to Learning Physical Dynamics , 2016, ICLR.