Video In Sentences Out

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases,spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

[1]  Andrew J. Viterbi,et al.  Convolutional Codes and Their Performance in Communication Systems , 1971 .

[2]  K. X. M. Tzeng,et al.  Convolutional Codes and 'Their Performance in Communication Systems , 1971 .

[3]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[4]  C. Tomasi Detection and Tracking of Point Features , 1991 .

[5]  M. R. Manzini Learnability and Cognition , 1991 .

[6]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  P. Kevitt Integration of Natural Language and Vision Processing , 1995, Springer Netherlands.

[8]  Larry S. Davis,et al.  Towards 3-D model-based tracking and recognition of human movement: a multi-view approach , 1995 .

[9]  Jeffrey Mark Siskind,et al.  A Maximum-Likelihood Approach to Visual Event Classification , 1996, ECCV.

[10]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Shiqiang Yang,et al.  Motion based event recognition using HMM , 2002, Object recognition supported by user interaction for service robots.

[13]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Gu Xu,et al.  An HMM-based framework for video semantic analysis , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Siobhan Chapman Logic and Conversation , 2005 .

[17]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[18]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[19]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[20]  Barbara Caputo,et al.  Local velocity-adapted motion events for spatio-temporal recognition , 2007, Comput. Vis. Image Underst..

[21]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Stefan Roth,et al.  People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[26]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[31]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[33]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[34]  A. Wierzbicka,et al.  Semantics and cognition. , 2006, Wiley interdisciplinary reviews. Cognitive science.

[35]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.