暂无分享,去创建一个
[1] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[2] Vinod Yegneswaran,et al. Automated Categorization of Onion Sites for Analyzing the Darkweb Ecosystem , 2017, KDD.
[3] Tianqi Liu,et al. BERT for Large-scale Video Segment Classification with Test-time Augmentation , 2019, ArXiv.
[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[5] Bruce A. Draper,et al. Gesture Recognition: Focus on the Hands , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6] Qifeng Chen,et al. Fully Automatic Video Colorization With Self-Regularization and Diversity , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Nitish Srivastava. Unsupervised Learning of Visual Representations using Videos , 2015 .
[8] Hossein Mobahi,et al. Deep learning from temporal coherence in video , 2009, ICML '09.
[9] Larry P. Heck,et al. Generative Visual Dialogue System via Adaptive Reasoning and Weighted Likelihood Estimation , 2019, ArXiv.
[10] Larry P. Heck,et al. Efficient Incremental Learning for Mobile Object Detection , 2019, ArXiv.
[11] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.
[12] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[13] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[14] Alexander C. Loui,et al. Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.
[15] Yingyu Liang,et al. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis , 2019, AAAI.
[16] Benoit Huet,et al. Fusion of Multimodal Embeddings for Ad-Hoc Video Search , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
[17] Geoffrey E. Hinton,et al. Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.
[18] Silvio Savarese,et al. Making Sense of Vision and Touch: Learning Multimodal Representations for Contact-Rich Tasks , 2019, IEEE Transactions on Robotics.
[19] Basavaraj A. Goudannavar,et al. Correlation analysis of audio and video contents: A metadata based approach , 2015, 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT).
[20] Hongxia Jin,et al. Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Justin Dauwels,et al. EduBrowser: A Multimodal Automated Monitoring System for Co-located Collaborative Learning , 2019, LTEC.
[22] Ian Davidson,et al. A Framework for Deep Constrained Clustering - Algorithms and Advances , 2019, ECML/PKDD.
[23] Juan Carlos Niebles,et al. What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[24] Henry A. Kautz,et al. Combining Subjective Probabilities and Data in Training Markov Logic Networks , 2012, ECML/PKDD.
[25] Xiao Liu,et al. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[26] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.
[27] Jinfeng Yi,et al. AdvIT: Adversarial Frames Identifier Based on Temporal Consistency in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Vladlen Koltun,et al. Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Rui Wang,et al. Virtual Reality Scene Construction Based on Multimodal Video Scene Segmentation Algorithm , 2019, 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC).
[30] Ivan Marsic,et al. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment , 2018, ACL.
[31] Marcel Worring,et al. Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .
[32] Yalin Wang,et al. Regularize, Expand and Compress: NonExpansive Continual Learning , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[33] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[34] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[35] Gaël Richard,et al. On the Correlation of Automatic Audio and Visual Segmentations of Music Videos , 2007, IEEE Transactions on Circuits and Systems for Video Technology.
[36] Eric Granger,et al. Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.
[37] Tomás Pajdla,et al. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[38] Larry P. Heck,et al. Contextual LSTM (CLSTM) models for Large scale NLP tasks , 2016, ArXiv.
[39] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).