暂无分享,去创建一个
Shiping Wen | Aidong Men | Zhuqing Jiang | Jianning Wu | Haiying Wang | S. Wen | Haiying Wang | Aidong Men | Zhuqing Jiang | Jianning Wu
[1] Bohyung Han,et al. Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[2] Chenliang Xu,et al. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.
[3] Yong Jae Lee,et al. Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[4] Huy Phan,et al. Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[5] Annamaria Mesaros,et al. Metrics for Polyphonic Sound Event Detection , 2016 .
[6] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.
[8] Joshua B. Tenenbaum,et al. Separating Style and Content with Bilinear Models , 2000, Neural Computation.
[9] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[10] Chenliang Xu,et al. Audio-Visual Event Localization in the Wild , 2019, CVPR Workshops.
[11] Amit K. Roy-Chowdhury,et al. W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.
[12] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[13] Louis-Philippe Morency,et al. Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.
[14] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.
[15] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[16] Chenhui Chu,et al. BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[17] Tao Zhang,et al. Step-by-step Erasion, One-by-one Collection: A Weakly Supervised Temporal Action Detector , 2018, ACM Multimedia.
[18] Justin Salamon,et al. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[19] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[20] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[21] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Dahua Lin,et al. PolyNet: A Pursuit of Structural Diversity in Very Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[24] Yueming Lyu,et al. Marginalized Average Attentional Network for Weakly-Supervised Learning , 2019, ICLR.
[25] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[26] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[27] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.
[28] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[29] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[32] Erik Cambria,et al. Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.
[33] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[34] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[35] Yuexian Zou,et al. CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Yehao Li,et al. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network , 2021, AAAI.
[37] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Yu-Chiang Frank Wang,et al. Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[39] Yang Gao,et al. Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[42] Shankar Kumar,et al. Neural Language Modeling with Visual Features , 2019, ArXiv.
[43] Qishuo Lu,et al. Split to Be Slim: An Overlooked Redundancy in Vanilla Convolution , 2020, IJCAI.
[44] Saurabh Singh,et al. Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Jianfeng Gao,et al. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Jiasen Lu,et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.