Learning visual models from paired audio-visual examples
暂无分享,去创建一个
[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[2] J. Kagan,et al. The Developmental Progression of Manipulative Play in the First Two Years. , 1976 .
[3] Harriet J. Nock,et al. Assessing face and speech consistency for monologue detection in video , 2002, MULTIMEDIA '02.
[4] Eric Krotkov,et al. Robotic Perception of Material , 1995, IJCAI.
[5] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.
[6] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.
[7] Michael I. Jordan,et al. A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .
[8] David Guth,et al. Echolocation Reconsidered: Using Spatial Variations in the Ambient Sound Field to Guide Locomotion , 1998 .
[9] Daniel P. W. Ellis,et al. Detecting local semantic concepts in environmental sounds using Markov model based clustering , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.
[10] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Jiri Matas,et al. All you need is a good init , 2015, ICLR.
[12] Antonio Torralba,et al. Spectral Hashing , 2008, NIPS.
[13] Philip H. S. Torr,et al. Joint Object-Material Category Segmentation from Audio-Visual Cues , 2016, BMVC.
[14] Heiga Zen,et al. Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.
[15] Anne Marie Tharpe,et al. Visual attention and hearing loss: past and current perspectives. , 2008, Journal of the American Academy of Audiology.
[16] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.
[17] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.
[18] D. Norman,et al. Everyday listening and auditory icons , 1988 .
[19] M. Mendelson,et al. The relation between audition and vision in the human newborn. , 1976, Monographs of the Society for Research in Child Development.
[20] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.
[21] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.
[22] Matthew W. G. Dye,et al. Is Visual Selective Attention in Deaf Individuals Enhanced or Deficient? The Case of the Useful Field of View , 2009, PloS one.
[23] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[24] R. Baillargeon. The Acquisition of Physical Knowledge in Infancy: A Summary in Eight Lessons , 2007 .
[25] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[26] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[27] Lawrence D. Jackel,et al. Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.
[28] Terrence J. Sejnowski,et al. The “independent components” of natural scenes are edge filters , 1997, Vision Research.
[29] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.
[30] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[31] Vesa T. Peltonen,et al. Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.
[32] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[33] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[34] N. Kanwisher,et al. Spatial pattern of BOLD fMRI activation reveals cross-modal information in auditory cortex. , 2012, Journal of neurophysiology.
[35] Bolei Zhou,et al. Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.
[36] Antonio Torralba,et al. Anticipating the future by watching unlabeled video , 2015, ArXiv.
[37] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Wojciech Zaremba,et al. Learning to Execute , 2014, ArXiv.
[39] Antonio Torralba,et al. Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.
[40] Michael Gasser,et al. The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.
[41] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[42] Trevor Darrell,et al. Data-dependent Initializations of Convolutional Neural Networks , 2015, ICLR.
[43] Jitendra Malik,et al. Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[44] Claudio Perez Tamargo. Can one hear the shape of a drum , 2008 .
[45] Honglak Lee,et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.
[46] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.
[47] Yair Weiss,et al. From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.
[48] Jitendra Malik,et al. Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.
[49] Joshua B. Tenenbaum,et al. Black boxes: Hypothesis testing via indirect perceptual evidence , 2014, CogSci.
[50] David J. Field,et al. Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.
[51] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.
[52] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.
[53] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[54] Carl Doersch,et al. Supervision Beyond Manual Annotations for Learning Visual Representations , 2016 .
[55] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[56] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[57] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.
[58] Hans-Jochen Heinze,et al. Sound increases the saliency of visual events , 2008, Brain Research.
[59] Yi Hu,et al. Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.
[60] Edward H. Adelson,et al. On seeing stuff: the perception of materials by humans and machines , 2001, IS&T/SPIE Electronic Imaging.
[61] Daniel P. W. Ellis,et al. Classifying soundtracks with audio texture features , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).