Generating Descriptions with Grounded and Co-referenced People
暂无分享,去创建一个
Seong Joon Oh | Bernt Schiele | Marcus Rohrbach | Anna Rohrbach | Siyu Tang | B. Schiele | Marcus Rohrbach | Anna Rohrbach | Siyu Tang
[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[2] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[3] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.
[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[5] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Andrew Zisserman,et al. Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.
[7] Neil Genzlinger. A. and Q , 2006 .
[8] M. R. Rao,et al. The partition problem , 1993, Math. Program..
[9] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.
[10] Xiaogang Wang,et al. DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[11] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[12] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Tatsuya Harada,et al. Beyond caption to narrative: Video captioning with multiple sentences , 2016, 2016 IEEE International Conference on Image Processing (ICIP).
[14] Rainer Stiefelhagen,et al. “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[15] Jitendra Malik,et al. Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[16] Bernt Schiele,et al. Subgraph decomposition for multi-target tracking , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] B. Taskar,et al. Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[18] Sanja Fidler,et al. Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[19] Ming Yang,et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[20] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.
[21] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[23] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Chenxi Liu,et al. Attention Correctness in Neural Image Captioning , 2016, AAAI.
[25] Andrew Zisserman,et al. Deep Face Recognition , 2015, BMVC.
[26] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[27] Cordelia Schmid,et al. Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.
[28] Trevor Darrell,et al. LSDA: Large Scale Detection through Adaptation , 2014, NIPS.
[29] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.
[30] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Xiaogang Wang,et al. Deeply learned face representations are sparse, selective, and robust , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Qi Yin,et al. Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? , 2015, ArXiv.
[33] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Bernt Schiele,et al. Fine-Grained Activity Recognition with Holistic and Pose Based Features , 2014, GCPR.
[35] Jeffrey Mark Siskind,et al. Grounded Language Learning from Video Described with Sentences , 2013, ACL.
[36] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[37] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.
[38] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.
[39] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Yoshiko Wakabayashi,et al. A cutting plane algorithm for a clustering problem , 1989, Math. Program..
[41] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[42] Christopher Joseph Pal,et al. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.
[43] Christopher D. Manning,et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.
[44] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[45] Bernt Schiele,et al. The Long-Short Story of Movie Description , 2015, GCPR.
[46] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.
[47] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[49] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Wojciech Zaremba,et al. Learning to Execute , 2014, ArXiv.
[51] Andrew Zisserman,et al. “Who are you?” - Learning person specific classifiers from video , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[52] Bernt Schiele,et al. Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.
[53] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Dekang Lin,et al. Bootstrapping Path-Based Pronoun Resolution , 2006, ACL.
[55] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] O. Parkhi. It ’ s in the bag : Stronger supervision for automated face labelling , 2015 .
[57] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[58] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.
[59] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.
[60] Sanja Fidler,et al. Generating Multi-sentence Natural Language Descriptions of Indoor Scenes , 2015, BMVC.
[61] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[62] Fei-Fei Li,et al. Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.
[63] Cristian Sminchisescu,et al. Spatio-Temporal Attention Models for Grounded Video Captioning , 2016, ACCV.
[64] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[65] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).