论文信息 - An Analysis of Action Recognition Datasets for Language and Vision Tasks

An Analysis of Action Recognition Datasets for Language and Vision Tasks

A large amount of recent research has focused on tasks that combine language and vision, resulting in a proliferation of datasets and methods. One such task is action recognition, whose applications include image annotation, scene under- standing and image retrieval. In this survey, we categorize the existing ap- proaches based on how they conceptualize this problem and provide a detailed review of existing datasets, highlighting their di- versity as well as advantages and disad- vantages. We focus on recently devel- oped datasets which link visual informa- tion with linguistic resources and provide a fine-grained syntactic and semantic anal- ysis of actions in images.

Frank Keller | Spandana Gella | Frank Keller | Spandana Gella

[1] Svetlana Lazebnik,et al. Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[2] Jiaxuan Wang,et al. HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Nazli Ikizler-Cinbis,et al. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[4] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[5] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[6] Michael S. Bernstein,et al. Visual Relationship Detection with Language Priors , 2016, ECCV.

[7] Raffaella Bernardi,et al. TUHOI: Trento Universal Human Object Interaction Dataset , 2014, VL@COLING.

[8] Ivan Laptev,et al. Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[9] Frank Keller,et al. Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[10] Beth Levin,et al. English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[11] Yiannis Aloimonos,et al. Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[12] Raffaella Bernardi,et al. Exploiting language models to recognize unseen actions , 2013, ICMR '13.

[13] Luc Van Gool,et al. The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[14] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[15] Pinar Duygulu Sahin,et al. Recognizing actions from still images , 2008, 2008 19th International Conference on Pattern Recognition.

[16] John B. Lowe,et al. The Berkeley FrameNet Project , 1998, ACL.

[17] David A. Forsyth,et al. Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis , 2005, Found. Trends Comput. Graph. Vis..

[18] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Pietro Perona,et al. Describing Common Human Visual Actions in Images , 2015, BMVC.

[20] Yann LeCun,et al. Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[21] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22] Francis Ferraro,et al. A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[23] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[25] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[26] Jitendra Malik,et al. Visual Semantic Role Labeling , 2015, ArXiv.

[27] Ali Farhadi,et al. Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Licheng Yu,et al. Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30] David G. Lowe,et al. Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[31] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33] Samy Bengio,et al. Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Mitchell P. Marcus,et al. OntoNotes: The 90% Solution , 2006, NAACL.

[35] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[36] Simone Paolo Ponzetto,et al. BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[37] Martha Palmer,et al. Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[38] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39] Changsong Liu,et al. Grounded Semantic Role Labeling , 2016, NAACL.

[40] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[41] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[42] Marie-Francine Moens,et al. Multi-Modal Representations for Improved Bilingual Lexicon Learning , 2016, ACL.

[43] Hans-Hellmut Nagel,et al. A vision of ‘vision and language’ comprises action: An example from road traffic , 2004, Artificial Intelligence Review.

[44] Nazli Ikizler-Cinbis,et al. Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[45] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[46] James F. O'Brien,et al. Computational Studies of Human Motion , 2006 .