Visually Indicated Sounds

Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.

[1]  M. Kac Can One Hear the Shape of a Drum , 1966 .

[2]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[3]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[4]  William W. Gaver What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[5]  Malcolm Slaney,et al.  Pattern Playback in the 90s , 1994, NIPS.

[6]  Eric Krotkov,et al.  Robotic Perception of Material , 1995, IJCAI.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Dinesh K. Pai,et al.  FoleyAutomatic: physically-based sound effects for interactive simulation and animation , 2001, SIGGRAPH.

[9]  Yi Hu,et al.  Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.

[10]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[11]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  R. Baillargeon The Acquisition of Physical Knowledge in Infancy: A Summary in Eight Lessons , 2007 .

[13]  M. Lewicki,et al.  Statistical modeling of intrinsic structures in impacts sounds. , 2007, The Journal of the Acoustical Society of America.

[14]  R. Lutfi Human Sound Source Identification , 2008 .

[15]  Jivko Sinapov,et al.  Interactive learning of the acoustic properties of household objects , 2009, 2009 IEEE International Conference on Robotics and Automation.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[19]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[20]  Eero P. Simoncelli,et al.  Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[21]  Terri L. Bonebright COCONUTS OR HORSE HOOFS ? VISUAL CONTEXT EFFECTS ON IDENTIFICATION AND PERCEIVED VERACITY OF EVERYDAY SOUNDS , 2012 .

[22]  L. Schulz The origins of inquiry: inductive inference and exploration in early childhood , 2012, Trends in Cognitive Sciences.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  N. Kanwisher,et al.  Spatial pattern of BOLD fMRI activation reveals cross-modal information in auditory cortex. , 2012, Journal of neurophysiology.

[25]  Terri L. Bonebright Were those coconuts or horse hoofs? Visual context effects on identification and veracity of everyday sounds , 2012 .

[26]  Edward H. Adelson,et al.  Recognizing Materials Using Perceptually Inspired Features , 2013, International Journal of Computer Vision.

[27]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[30]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[33]  Joshua B. Tenenbaum,et al.  Black boxes: Hypothesis testing via indirect perceptual evidence , 2014, CogSci.

[34]  Ashutosh Saxena,et al.  Learning haptic representation for manipulating deformable food objects , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Frédo Durand,et al.  The visual microphone , 2014, ACM Trans. Graph..

[36]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[38]  Kristen Grauman,et al.  Learning image representations equivariant to ego-motion , 2015, ArXiv.

[39]  Jonathan Tompson,et al.  Unsupervised Feature Learning from Temporal Data , 2015, ICLR.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[42]  Philip H. S. Torr,et al.  Joint Object-Material Category Segmentation from Audio-Visual Cues , 2016, BMVC.

[43]  Frédo Durand,et al.  Visual vibrometry: Estimating material properties from small motions in video , 2015, CVPR.

[44]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[45]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  L. Verhoeven,et al.  Can one Hear the Shape of a Drum? , 2015 .

[47]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[48]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[53]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Frédo Durand,et al.  Visual vibrometry: Estimating material properties from small motions in video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chuang Gan,et al.  The Sound of Pixels , 2018, ECCV.

[56]  A. Shabana Theory of Vibration , 2011, Mechanical Engineering Series.