Video Description
暂无分享,去创建一个
Wei Liu | Syed Zulqarnain Gilani | Ajmal Mian | Nayyer Aafaq | W. Liu | A. Mian | S. Z. Gilani | Nayyer Aafaq
[1] Marcus Rohrbach,et al. Multimodal Video Description , 2016, ACM Multimedia.
[2] L. ChenDavid,et al. Training a multilingual sportscaster , 2010 .
[3] Yejin Choi,et al. TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.
[4] Ehud Reiter,et al. Book Reviews: Building Natural Language Generation Systems , 2000, CL.
[5] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.
[7] Lorenzo Torresani,et al. C3D: Generic Features for Video Analysis , 2014, ArXiv.
[8] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..
[9] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[10] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.
[11] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.
[12] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.
[13] Matthew Brand,et al. The "Inverse Hollywood Problem": From Video to Scripts and Storyboards via Causal Analysis , 1997, AAAI/IAAI.
[14] Wei Liu,et al. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[15] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[16] Matt J. Kusner,et al. From Word Embeddings To Document Distances , 2015, ICML.
[17] Bernt Schiele,et al. Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.
[18] Antonio Torralba,et al. Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[19] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[20] S. M. García,et al. 2014: , 2020, A Party for Lazarus.
[21] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[22] Timothy Baldwin,et al. Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.
[23] Boqing Gong,et al. End-to-End Video Captioning With Multitask Reinforcement Learning , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).
[24] Mun Wai Lee,et al. SAVE: A framework for semantic annotation of visual events , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.
[25] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[26] Peng Wang,et al. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Robert J. Gaizauskas,et al. Cross-validating Image Description Datasets and Evaluation Metrics , 2016, LREC.
[28] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.
[29] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.
[30] Jeffrey Mark Siskind,et al. Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information , 2015, AAAI.
[31] Jason J. Corso. GBS: Guidance by Semantics-Using High-Level Visual Inference to Improve Vision-Based Mobile Robot Localization , 2015 .
[32] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.
[33] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[34] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[35] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[36] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Claudio S. Pinhanez,et al. Human action detection using PNF propagation of temporal constraints , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).
[38] Gregory D. Hager,et al. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, CVPR.
[39] Aaron F. Bobick,et al. A State-Based Approach to the Representation and Recognition of Gesture , 1997, IEEE Trans. Pattern Anal. Mach. Intell..
[40] Xin Wang,et al. Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41] Yongdong Zhang,et al. Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Kunio Fukunaga,et al. Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.
[43] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[44] Christopher Joseph Pal,et al. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.
[45] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[46] Yi Yang,et al. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Liang Lin,et al. Interpretable Video Captioning via Trajectory Structured Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[48] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Qingming Huang,et al. Less Is More: Picking Informative Frames for Video Captioning , 2018, ECCV.
[50] Fujio Nishida,et al. Japanese-English Translation Through Internal Expressions , 1982, COLING.
[51] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.
[52] David G. Lowe,et al. Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.
[53] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[54] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.
[55] Bernt Schiele,et al. Grounding Action Descriptions in Videos , 2013, TACL.
[56] Tatsuya Harada,et al. Beyond caption to narrative: Video captioning with multiple sentences , 2016, 2016 IEEE International Conference on Image Processing (ICIP).
[57] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.
[58] Wei Liu,et al. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Yejin Choi,et al. Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.
[60] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[61] Dan Klein,et al. Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.
[62] Tamara L. Berg,et al. Baby Talk : Understanding and Generating Image Descriptions , 2011 .
[63] Muhammad Usman Ghani Khan,et al. Describing Video Contents in Natural Language , 2012 .
[64] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[65] Shaogang Gong,et al. Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[66] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Mert Kilickaya,et al. Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.
[68] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.
[69] Tieniu Tan,et al. M3: Multimodal Memory Modelling for Video Captioning , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[70] Yaser Sheikh,et al. CASEE: A Hierarchical Event Representation for the Analysis of Videos , 2004, AAAI.
[71] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.
[72] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.
[73] Lei Zhang,et al. Human Focused Video Description , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).
[74] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.
[75] Ramakant Nevatia,et al. Semantic Aware Video Transcription Using Random Forest Classifiers , 2014, ECCV.
[76] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[77] Xirong Li,et al. Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.
[78] George Awad,et al. Evaluation of automatic video captioning using direct assessment , 2017, PloS one.
[79] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[80] Bernt Schiele,et al. Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.
[81] Florian Metze,et al. Beyond audio and video retrieval: towards multimedia summarization , 2012, ICMR.
[82] Frank Keller,et al. Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.
[83] Stephen E. Robertson,et al. Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.
[84] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.
[85] Jorma Laaksonen,et al. Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation , 2016, ACM Multimedia.
[86] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..
[87] Dieter Fox,et al. Following directions using statistical machine translation , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).
[88] Tao Mei,et al. Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[89] Carlo Tomasi,et al. Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
[90] Kate Saenko,et al. Joint Event Detection and Description in Continuous Video Streams , 2018, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).
[91] Irfan A. Essa,et al. Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.
[92] S. David,et al. Human Face Detection and Tracking using Skin Color Modeling and Connected Component Operators , 2002 .
[93] Juan Carlos Niebles,et al. Title Generation for User Generated Videos , 2016, ECCV.
[94] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.
[95] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[96] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[97] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[98] Ramakanth Pasunuru,et al. Reinforced Video Captioning with Entailment Rewards , 2017, EMNLP.
[99] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.
[100] Tao Mei,et al. MSR Asia MSM at ActivityNet Challenge 2017: Trimmed Action Recognition, Temporal Action Proposals and Dense-Captioning Events in Videos , 2017 .
[101] Shin'ichi Satoh,et al. Consensus-based Sequence Training for Video Captioning , 2017, ArXiv.
[102] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[103] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[104] Trevor Darrell,et al. Textual Explanations for Self-Driving Vehicles , 2018, ECCV.
[105] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[106] Zhou Su,et al. Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[107] David A. McAllester,et al. A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.
[108] Jia Chen,et al. Describing Videos using Multi-modal Fusion , 2016, ACM Multimedia.
[109] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[110] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[111] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[112] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.
[113] C. Martin. 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.
[114] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[115] Tadaaki Tani,et al. Feedback of correcting information in postediting to a machine translation system , 1988, COLING.
[116] Yee Whye Teh,et al. Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..
[117] Sven J. Dickinson,et al. Video In Sentences Out , 2012, UAI.
[118] Ivan A. Sag,et al. Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.
[119] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[120] Jongwook Choi,et al. Supervising Neural Attention Models for Video Captioning by Human Gaze Data , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[121] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[122] Ramakant Nevatia,et al. Bayesian framework for video surveillance application , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.
[123] Deb Roy,et al. Semiotic schemas: A framework for grounding language in action and perception , 2005, Artif. Intell..
[124] Ramakant Nevatia,et al. An Ontology for Video Event Representation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.
[125] Matthew R. Walter,et al. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.
[126] Christopher Joseph Pal,et al. Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.
[127] Deb Roy,et al. Connecting language to the world , 2005, Artif. Intell..
[128] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[129] Hans-Hellmut Nagel,et al. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[130] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.
[131] Georges Quénot,et al. TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.
[132] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[133] Christopher Joseph Pal,et al. YouTube Scale, Large Vocabulary Video Annotation , 2010, Video Search and Mining.
[134] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[135] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.
[136] Zhe Gan,et al. Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[137] Jonathan G. Fiscus,et al. TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.
[138] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[139] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[140] Kate Saenko,et al. Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text , 2016, EMNLP.
[141] Marcus Rohrbach,et al. A Dataset for Telling the Stories of Social Media Videos , 2018, EMNLP.
[142] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[143] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[144] Wonjun Kim,et al. A Novel Method for Efficient Indoor–Outdoor Image Classification , 2010, J. Signal Process. Syst..
[145] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[146] Tom Schaul,et al. FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.
[147] Bernard Ghanem,et al. ActivityNet Challenge 2017 Summary , 2017, ArXiv.
[148] Rita Cucchiara,et al. Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[149] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[150] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[151] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[152] Leonidas J. Guibas,et al. The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.
[153] Marcus Rohrbach,et al. A Multi-scale Multiple Instance Video Description Network , 2015, ArXiv.
[154] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[155] Klamer Schutte,et al. Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions , 2012, ECCV Workshops.
[156] Klamer Schutte,et al. Recognition of 48 Human Behaviors from Video , 2012 .
[157] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[158] David A. McAllester,et al. Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[159] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[160] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.
[161] Jonathan Weese,et al. UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.
[162] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[163] Bernt Schiele,et al. The Long-Short Story of Movie Description , 2015, GCPR.
[164] Zhongchao Shi,et al. Boosting Video Description Generation by Explicitly Translating from Frame-Level Captions , 2016, ACM Multimedia.
[165] Licheng Yu,et al. Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[166] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[167] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[168] Ilias Maglogiannis,et al. Face detection and recognition of natural human emotion using Markov random fields , 2007, Personal and Ubiquitous Computing.
[169] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[170] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[171] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[172] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[173] L. Miles,et al. 2000 , 2000, RDH.
[174] Xu Wei,et al. Learning Like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[175] Ning Zhang,et al. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[176] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.
[177] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[178] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.