Seeing What You're Told: Sentence-Guided Activity Recognition in Video

We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.

[1]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jack K. Wolf,et al.  Finding the best set of K paths through a trellis with application to multitarget tracking , 1989 .

[3]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[4]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[5]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[6]  Li Zhuo,et al.  Semantic context based refinement for news video annotation , 2013, Multimedia Tools and Applications.

[7]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[8]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  Barbara Caputo,et al.  Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation , 2009, NIPS.

[10]  K. X. M. Tzeng,et al.  Convolutional Codes and 'Their Performance in Communication Systems , 1971 .

[11]  Pau Baiget,et al.  Natural Language Descriptions of Human Behavior from Video Sequences , 2007, KI.

[12]  Jeffrey Mark Siskind,et al.  Simultaneous Object Detection, Tracking, and Event Recognition , 2012, ArXiv.

[13]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[14]  Piji Li,et al.  What is happening in a still picture? , 2011, The First Asian Conference on Pattern Recognition.

[15]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[16]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[17]  Muhammad Usman Ghani Khan,et al.  Describing Video Contents in Natural Language , 2012 .

[18]  Sven J. Dickinson,et al.  Video In Sentences Out , 2012, UAI.

[19]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.