论文信息 - Ambient Sound Provides Supervision for Visual Learning

Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.

[1] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Eero P. Simoncelli,et al. Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .

[3] Daniel P. W. Ellis,et al. Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[8] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[10] Edward H. Adelson,et al. Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[11] Jitendra Malik,et al. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[12] Daniel P. W. Ellis,et al. Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[14] Krista A. Ehinger,et al. SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15] Antonio Torralba,et al. Spectral Hashing , 2008, NIPS.

[16] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[17] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.

[18] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21] William W. Gaver. What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .

[22] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[23] Jiri Matas,et al. All you need is a good init , 2015, ICLR.

[24] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] Vesa T. Peltonen,et al. Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[27] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[28] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[29] Michael Elad,et al. Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[30] Trevor Darrell,et al. Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.

[31] Kristen Grauman,et al. Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .

[33] Geoffrey E. Hinton,et al. Semantic hashing , 2009, Int. J. Approx. Reason..

[34] Jonathan Tompson,et al. Unsupervised Feature Learning from Temporal Data , 2015, ICLR.

[35] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[36] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[37] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Thomas Brox,et al. Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.