Learning visual models from paired audio-visual examples

From the clink of a mug placed onto a saucer to the bustle of a busy caf6, our days are filled with visual experiences that are accompanied by distinctive sounds. In this thesis, we show that these sounds can provide a rich training signal for learning visual models. First, we propose the task of predicting the sound that an object makes when struck as a way of studying physical interactions within a visual scene. We demonstrate this idea by training an algorithm to produce plausible soundtracks for videos in which people hit and scratch objects with a drumstick. Then, with human studies and automated evaluations on recognition tasks, we verify that the sounds produced by the algorithm convey information about actions and material properties. Second, we show that ambient audio e.g., crashing waves, people speaking in a crowd can also be used to learn visual models. We train a convolutional neural network to predict a statistical summary of the sounds that occur within a scene, and we demonstrate that the visual representation learned by the model conveys information about objects and scenes. Thesis Supervisor: William Freeman Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Readers: Antonio Torralba, Josh McDermott

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  J. Kagan,et al.  The Developmental Progression of Manipulative Play in the First Two Years. , 1976 .

[3]  Harriet J. Nock,et al.  Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.

[4]  Eric Krotkov,et al.  Robotic Perception of Material , 1995, IJCAI.

[5]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[6]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[7]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[8]  David Guth,et al.  Echolocation Reconsidered: Using Spatial Variations in the Ambient Sound Field to Guide Locomotion , 1998 .

[9]  Daniel P. W. Ellis,et al.  Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[12]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[13]  Philip H. S. Torr,et al.  Joint Object-Material Category Segmentation from Audio-Visual Cues , 2016, BMVC.

[14]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[15]  Anne Marie Tharpe,et al.  Visual attention and hearing loss: past and current perspectives. , 2008, Journal of the American Academy of Audiology.

[16]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[17]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[18]  D. Norman,et al.  Everyday listening and auditory icons , 1988 .

[19]  M. Mendelson,et al.  The relation between audition and vision in the human newborn. , 1976, Monographs of the Society for Research in Child Development.

[20]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[21]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[22]  Matthew W. G. Dye,et al.  Is Visual Selective Attention in Deaf Individuals Enhanced or Deficient? The Case of the Useful Field of View , 2009, PloS one.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  R. Baillargeon The Acquisition of Physical Knowledge in Infancy: A Summary in Eight Lessons , 2007 .

[25]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[27]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[28]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[29]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[30]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[33]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[34]  N. Kanwisher,et al.  Spatial pattern of BOLD fMRI activation reveals cross-modal information in auditory cortex. , 2012, Journal of neurophysiology.

[35]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[36]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[37]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[39]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[40]  Michael Gasser,et al.  The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.

[41]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Trevor Darrell,et al.  Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[43]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Claudio Perez Tamargo Can one hear the shape of a drum , 2008 .

[45]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[46]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[47]  Yair Weiss,et al.  From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.

[48]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[49]  Joshua B. Tenenbaum,et al.  Black boxes: Hypothesis testing via indirect perceptual evidence , 2014, CogSci.

[50]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[51]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[52]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[53]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[54]  Carl Doersch,et al.  Supervision Beyond Manual Annotations for Learning Visual Representations , 2016 .

[55]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[57]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[58]  Hans-Jochen Heinze,et al.  Sound increases the saliency of visual events , 2008, Brain Research.

[59]  Yi Hu,et al.  Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.

[60]  Edward H. Adelson,et al.  On seeing stuff: the perception of materials by humans and machines , 2001, IS&T/SPIE Electronic Imaging.

[61]  Daniel P. W. Ellis,et al.  Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).