Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
暂无分享,去创建一个
[1] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Tuomas Virtanen,et al. Context-dependent sound event detection , 2013, EURASIP Journal on Audio, Speech, and Music Processing.
[4] P. Bertelson,et al. Recalibration of temporal order perception by exposure to audio-visual asynchrony. , 2004, Brain research. Cognitive brain research.
[5] Xiaogang Wang,et al. Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation , 2020, ECCV.
[6] Rogério Schmidt Feris,et al. Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.
[7] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[8] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[9] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.
[10] Ankit Shah,et al. DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.
[11] Xiaogang Wang,et al. Vision-Infused Deep Audio Inpainting , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[12] Chenliang Xu,et al. An Attempt towards Interpretable Audio-Visual Video Captioning , 2018, ArXiv.
[13] Justin Salamon,et al. Adaptive Pooling Operators for Weakly Labeled Sound Event Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[14] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[15] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[16] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[17] P. Herrera,et al. RECURRENCE QUANTIFICATION ANALYSIS FEATURES FOR AUDITORY SCENE CLASSIFICATION , 2013 .
[18] Chenliang Xu,et al. Can multisensory training aid visual learning? A computational investigation. , 2019, Journal of vision.
[19] Lorenzo Torresani,et al. SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[20] Navdeep Jaitly,et al. Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.
[21] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Leonid Sigal,et al. Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Heikki Huttunen,et al. Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[25] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[26] Chuang Gan,et al. Music Gesture for Visual Sound Separation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Aaron R. Seitz,et al. Benefits of multisensory learning , 2008, Trends in Cognitive Sciences.
[29] Tao Mei,et al. Gaussian Temporal Awareness Networks for Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Milind R. Naphade,et al. Discovering recurrent events in video using unsupervised methods , 2002, Proceedings. International Conference on Image Processing.
[31] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Bhiksha Raj,et al. Experiments on the DCASE Challenge 2016: Acoustic Scene Classification and Sound Event Detection in Real Life Recording , 2016, DCASE.
[33] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[34] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[35] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Dumitru Erhan,et al. Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.
[37] Cordelia Schmid,et al. Temporal Localization of Actions with Actoms. , 2013, IEEE transactions on pattern analysis and machine intelligence.
[38] C. Spence,et al. Multisensory Integration: Maintaining the Perception of Synchrony , 2003, Current Biology.
[39] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[40] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[41] Chenliang Xu,et al. Audio-Visual Event Localization in the Wild , 2019, CVPR Workshops.
[42] Amit K. Roy-Chowdhury,et al. W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.
[43] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[45] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[46] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Qi Zhang,et al. The thermal signature of a submerged jet impacting normal to a free surface , 2016, J. Vis..
[48] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[50] Yong Jae Lee,et al. Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[51] Bohyung Han,et al. Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[52] Tae-Hyun Oh,et al. Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Xin Wang,et al. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning , 2018, NAACL.
[54] Mubarak Shah,et al. Video Scene Understanding Using Multi-scale Analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.
[55] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[56] Justin Salamon,et al. A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.
[57] Daochang Liu,et al. Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[59] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .
[60] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[61] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[62] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[63] Yi-Hsuan Yang,et al. Learning to Recognize Transient Sound Events using Attentional Supervision , 2018, IJCAI.
[64] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[65] B. Holden. Listen and learn , 2002 .
[66] Yapeng Tian,et al. Audio-Visual Interpretable and Controllable Video Captioning , 2019, CVPR Workshops.
[67] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[68] Yong Xu,et al. Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[69] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[70] Tomás Lozano-Pérez,et al. A Framework for Multiple-Instance Learning , 1997, NIPS.
[71] Yong Jae Lee,et al. Weakly-supervised Discovery of Visual Pattern Configurations , 2014, NIPS.
[72] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..
[73] Andrew Owens,et al. Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.
[74] Nicolas Turpault,et al. Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments , 2018, DCASE.
[75] Florian Metze,et al. A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[76] David A. Bulkin,et al. Seeing sounds: visual and auditory interactions in the brain , 2006, Current Opinion in Neurobiology.
[77] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[78] Kristen Grauman,et al. 2.5D Visual Sound , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[79] Yu-Chiang Frank Wang,et al. Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[80] Chuang Gan,et al. The Sound of Motions , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).