暂无分享,去创建一个
Jianfeng Gao | Asli Celikyilmaz | Shih-Fu Chang | Jianwei Yang | Hamid Palangi | Hassan Akbari | Paul Smolensky | Roland Fernandez | Sudha Rao | P. Smolensky | Jianfeng Gao | Shih-Fu Chang | Asli Celikyilmaz | H. Palangi | Jianwei Yang | Hassan Akbari | Sudha Rao | Roland Fernandez
[1] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[5] Mohit Bansal,et al. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.
[6] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[7] Li Deng,et al. Grammatically-Interpretable Learned Representations in Deep NLP Models , 2018 .
[8] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[9] Li Deng,et al. Question-Answering with Grammatically-Interpretable Representations , 2017, AAAI.
[10] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[11] Michael P. Kaschak,et al. Grounding language in action , 2002, Psychonomic bulletin & review.
[12] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.
[13] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.
[14] Jacob Andreas,et al. Experience Grounds Language , 2020, EMNLP.
[15] Rishabh Singh,et al. Learning and analyzing vector encoding of symbolic representations , 2018, ICLR.
[16] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[17] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[18] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[19] Jianfeng Gao,et al. Robust Conversational AI with Grounded Text Generation , 2020, ArXiv.
[20] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[21] Sheng Liu,et al. SibNet: Sibling Convolutional Encoder for Video Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[22] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[23] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[24] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[25] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[26] Zhongfei Zhang,et al. TVT: Two-View Transformer Network for Video Captioning , 2018, ACML.
[27] Zheng-Jun Zha,et al. Learning to Discretely Compose Reasoning Module Networks for Video Captioning , 2020, IJCAI.
[28] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[29] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[30] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[31] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.
[32] Ewan Dunbar,et al. RNNs Implicitly Implement Tensor Product Representations , 2018, ICLR.
[33] Jianfeng Gao,et al. Mapping natural-language problems to formal-language solutions using structured neural representations , 2019, ICML.
[34] Yongdong Zhang,et al. Learning Multimodal Attention LSTM Networks for Video Captioning , 2017, ACM Multimedia.
[35] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[36] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[37] Ming Zhou,et al. Dense Procedure Captioning in Narrated Instructional Videos , 2019, ACL.
[38] Yu-Gang Jiang,et al. Motion Guided Spatial Attention for Video Captioning , 2019, AAAI.
[39] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[40] Xin Wang,et al. Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[42] Jianfeng Gao,et al. Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , 2019, ArXiv.
[43] Qingming Huang,et al. Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.
[44] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[45] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[46] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[48] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[49] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[50] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Tao Mei,et al. Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning , 2019, AAAI.
[52] Jianfei Cai,et al. Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Geoffrey E. Hinton. Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .
[54] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[55] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[56] Jianfeng Gao,et al. HUBERT Untangles BERT to Improve Transfer across NLP Tasks , 2019, ArXiv.
[57] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.
[59] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[60] Jiebo Luo,et al. Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[61] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[62] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.
[63] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .
[64] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[65] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[67] R. Thomas McCoy,et al. Discovering the Compositional Structure of Vector Representations with Role Learning Networks , 2019, BLACKBOXNLP.
[68] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.
[69] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[70] Yu-Wing Tai,et al. Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.
[72] Jianfeng Gao,et al. Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.
[73] Wei Liu,et al. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.
[75] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[76] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[77] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[78] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.
[79] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.