The visual microphone

When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects---a glass of water, a potted plant, a box of tissues, or a bag of chips---into visual microphones. We recover sounds from high-speed footage of a variety of objects with different properties, and use both real and simulated data to examine some of the factors that affect our ability to visually recover sound. We evaluate the quality of recovered sounds using intelligibility and SNR metrics and provide input and recovered audio samples for direct comparison. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object's surface, which we can use to recover the vibration modes of an object.

[1]  Steve Rothberg,et al.  Laser vibrometry: Pseudo-vibrations , 1989 .

[2]  David Salesin,et al.  Interactive digital photomontage , 2004, SIGGRAPH 2004.

[3]  Frédo Durand,et al.  Phase-based video motion processing , 2013, ACM Trans. Graph..

[4]  Frédo Durand,et al.  Structural Modal Identification Through High Speed Camera Video: Motion Magnification , 2014 .

[5]  Frédo Durand,et al.  Motion magnification , 2005, ACM Trans. Graph..

[6]  Irfan A. Essa,et al.  Calibration-free rolling shutter removal , 2012, 2012 IEEE International Conference on Computational Photography (ICCP).

[7]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[8]  John H. L. Hansen,et al.  An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[9]  Joseph Morlier,et al.  New Image Processing Tools for Structural Dynamic Monitoring , 2007 .

[10]  Michael Rubinstein,et al.  Analysis and visualization of temporal variations in video , 2014 .

[11]  Marc M. Van Hulle,et al.  A phase-based approach to the estimation of the optical flow field using spatial filtering , 2002, IEEE Trans. Neural Networks.

[12]  N. Molin,et al.  Resonances of a Violin Body Studied by Hologram Interferometry and Acoustical Methods , 1970 .

[13]  Andrew Owens,et al.  Discrete-continuous optimization for large-scale structure from motion , 2011, CVPR 2011.

[14]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[15]  Philipos C. Loizou,et al.  Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Leo Grady,et al.  A multilevel banded graph cuts method for fast image segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  Michael Gleicher,et al.  Content-preserving warps for 3D video stabilization , 2009, ACM Trans. Graph..

[18]  Frédo Durand,et al.  Eulerian video magnification for revealing subtle changes in the world , 2012, ACM Trans. Graph..

[19]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[20]  Nicolas Andreff,et al.  Kinematics from Lines in a Single Rolling Shutter Image , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Frédo Durand,et al.  Riesz pyramids for fast phase-based video magnification , 2014, 2014 IEEE International Conference on Computational Photography (ICCP).

[22]  Raymond N. J. Veldhuis,et al.  Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes , 1986, IEEE Trans. Acoust. Speech Signal Process..

[23]  Junichi Nakamura,et al.  Image Sensors and Signal Processing for Digital Still Cameras , 2005 .

[24]  S. Shankar Sastry,et al.  Geometric Models of Rolling-Shutter Cameras , 2005, ArXiv.

[25]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[26]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Zeev Zalevsky,et al.  Simultaneous remote extraction of multiple speech sources and heart beats from secondary speckles pattern. , 2009, Optics express.

[28]  David J. Ewins,et al.  MODAL TESTING USING A SCANNING LASER DOPPLER VIBROMETER , 1999 .

[29]  Edward H. Adelson,et al.  Shiftable multiscale transforms , 1992, IEEE Trans. Inf. Theory.

[30]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[31]  Karl A. Stetson,et al.  Interferometric Vibration Analysis by Wavefront Reconstruction , 1965 .

[32]  Emanuele Zappa,et al.  Uncertainty analysis of high frequency image-based vibration measurements , 2013 .