Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
暂无分享,去创建一个
[1] Wenwu Wang,et al. Audio–Visual Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking , 2020, IEEE Transactions on Multimedia.
[2] Yuexian Zou,et al. Weakly Labelled Audio Tagging Via Convolutional Networks with Spatial and Channel-Wise Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[3] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[5] Mark D. Plumbley,et al. Weakly Labelled AudioSet Tagging With Attention Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[6] Yu-Chiang Frank Wang,et al. Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Yongdong Zhang,et al. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling , 2018, ACM Multimedia.
[8] Xiaojie Wang,et al. Object-Difference Attention: A Simple Relational Attention for Visual Question Answering , 2018, ACM Multimedia.
[9] In-So Kweon,et al. CBAM: Convolutional Block Attention Module , 2018, ECCV.
[10] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[11] Andrew Zisserman,et al. Learnable PINs: Cross-Modal Embeddings for Person Identity , 2018, ECCV.
[12] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[13] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[14] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[15] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[17] Yong Xu,et al. Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[19] Antonio Torralba,et al. See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.
[20] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[21] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[22] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[23] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Gregory D. Hager,et al. Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[26] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[27] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.
[28] Heikki Huttunen,et al. Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[29] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[30] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[31] Josef Kittler,et al. Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.
[32] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[33] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[34] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[35] Jonathon A. Chambers,et al. Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.
[36] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.
[37] A. Mesaros,et al. Context-dependent sound event detection , 2013, EURASIP J. Audio Speech Music. Process..
[38] C. Koch,et al. Explicit Encoding of Multimodal Percepts by Single Neurons in the Human Brain , 2009, Current Biology.
[39] Taras Butko,et al. Audiovisual event detection towards scene understanding , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.
[40] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[41] T. Rogers,et al. Where do you know what you know? The representation of semantic knowledge in the human brain , 2007, Nature Reviews Neuroscience.
[42] Manuele Bicego,et al. Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.
[43] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[44] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.
[45] William W. Gaver. What in the World Do We Hear? An Ecological Approach to Auditory Event Perception , 1993 .