Ambient Sound Provides Supervision for Visual Learning
暂无分享,去创建一个
Andrew Owens | Jiajun Wu | Antonio Torralba | William T. Freeman | Josh H. McDermott | A. Torralba | W. Freeman | Andrew Owens | Jiajun Wu
[1] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[2] Eero P. Simoncelli,et al. Article Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis , 2022 .
[3] Daniel P. W. Ellis,et al. Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.
[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[7] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[8] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.
[9] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.
[10] Edward H. Adelson,et al. Learning visual groups from co-occurrences in space and time , 2015, ArXiv.
[11] Jitendra Malik,et al. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.
[12] Daniel P. W. Ellis,et al. Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[13] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.
[14] Krista A. Ehinger,et al. SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[15] Antonio Torralba,et al. Spectral Hashing , 2008, NIPS.
[16] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.
[17] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.
[18] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[19] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[20] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[21] William W. Gaver. What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .
[22] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..
[23] Jiri Matas,et al. All you need is a good init , 2015, ICLR.
[24] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[25] Vesa T. Peltonen,et al. Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.
[26] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.
[27] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[28] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[29] Michael Elad,et al. Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[30] Trevor Darrell,et al. Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.
[31] Kristen Grauman,et al. Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[32] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .
[33] Geoffrey E. Hinton,et al. Semantic hashing , 2009, Int. J. Approx. Reason..
[34] Jonathan Tompson,et al. Unsupervised Feature Learning from Temporal Data , 2015, ICLR.
[35] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.
[36] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.
[37] Andrew Owens,et al. Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Thomas Brox,et al. Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[40] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.