Grounding Action Descriptions in Videos

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.

[1]  PantelPatrick,et al.  From frequency to meaning , 2010 .

[2]  Karl Stratos,et al.  Detecting Visual Text , 2012, NAACL.

[3]  Stefan Thater,et al.  Word Meaning in Context: A Simple and Effective Vector Model , 2011, IJCNLP.

[4]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[6]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[7]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[8]  Katrin Erk,et al.  Graded Word Sense Assignment , 2009, EMNLP.

[9]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[10]  Sven J. Dickinson,et al.  Learning the abstract motion semantics of verbs from captioned videos , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[11]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[12]  Carina Silberer,et al.  Grounded Models of Semantic Representation , 2012, EMNLP.

[13]  Jeff Orkin,et al.  Extracting aspects of determiner meaning from dialogue in a virtual world environment , 2011, IWCS.

[14]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[15]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[16]  M. Steyvers Combining Feature Norms and Text Data with Topic Models , 2022 .

[17]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[18]  Yansong Feng,et al.  Visual Information in Semantic Representation , 2010, NAACL.

[19]  Elia Bruni,et al.  Distributional semantics from text and images , 2011, GEMS.

[20]  Jana Kosecka,et al.  Language Models for Semantic Extraction and Filtering in Video Action Recognition , 2011, Language-Action Tools for Cognitive Artificial Agents.

[21]  Raymond J. Mooney,et al.  Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[22]  Katrin Erk,et al.  Measuring Word Meaning in Context , 2013, CL.

[23]  Jeff Orkin,et al.  Automatic learning and generation of social behavior from collective human gameplay , 2009, AAMAS.

[24]  Katrin Erk,et al.  Investigations on Word Senses and Word Usages , 2009, ACL.

[25]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[26]  Michael P. Kaschak,et al.  Grounding language in action , 2002, Psychonomic bulletin & review.

[27]  Steve R. Howell,et al.  A Model of Grounded Language Acquisition: Sensorimotor Features Improve Lexical and Grammatical Learning. , 2005 .

[28]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Raymond J. Mooney,et al.  Using closed captions as supervision for video activity recognition , 2010, AAAI 2010.