Coherent Multi-sentence Video Description with Variable Level of Detail

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

[1]  Jörg Tiedemann,et al.  Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation , 2013, ACL.

[2]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[3]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[4]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[5]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[6]  Lei Zhang,et al.  Towards coherent natural language description of video streams , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[8]  Jeffrey Mark Siskind,et al.  Seeing What You're Told: Sentence-Guided Activity Recognition in Video , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Lei Zhang,et al.  Human Focused Video Description , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[13]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[14]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[15]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[16]  Bernt Schiele,et al.  Discriminative Appearance Models for Pictorial Structures , 2011, International Journal of Computer Vision.

[17]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[18]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[19]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[24]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[25]  Ingrid Zukerman,et al.  Natural Language Processing and User Modeling: Synergies and Limitations , 2001, User Modeling and User-Adapted Interaction.

[26]  Andrew Zisserman,et al.  Hand detection using multiple proposals , 2011, BMVC.

[27]  Chong-Wah Ngo,et al.  Towards textually describing complex video contents with audio-visual concept classifiers , 2011, ACM Multimedia.

[28]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[29]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[32]  Alfred Kobsa User Modeling and User-Adapted Interaction , 2005, User Modeling and User-Adapted Interaction.