论文信息 - Unsupervised Semantic Parsing of Video Collections

Unsupervised Semantic Parsing of Video Collections

Human communication typically has an underlying structure. This is reflected in the fact that in many user generated videos, a starting point, ending, and certain objective steps between these two can be identified. In this paper, we propose a method for parsing a video into such semantic steps in an unsupervised way. The proposed method is capable of providing a semantic "storyline" of the video composed of its objective steps. We accomplish this utilizing both visual and language cues in a joint generative model. The proposed method can also provide a textual description for each of identified semantic steps and video segments. We evaluate this method on a large number of complex YouTube videos and show results of unprecedented quality for this new and impactful problem.

[1] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[4] Moritz Tenorth,et al. Understanding and executing instructions for everyday manipulation tasks from the World Wide Web , 2010, 2010 IEEE International Conference on Robotics and Automation.

[5] Raymond J. Mooney,et al. Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[6] Kevin Murphy,et al. What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[7] Ramakant Nevatia,et al. DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Yong Jae Lee,et al. Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[9] Jennifer Barry,et al. Bakebot: Baking Cookies with the PR2 , 2011 .

[10] Jeffrey Mark Siskind,et al. Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[11] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12] Anoop Gupta,et al. Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[13] C. Schmid,et al. Category-Specific Video Summarization , 2014, ECCV.

[14] Sanja Fidler,et al. A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Larry S. Davis,et al. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Ivan Laptev,et al. Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[20] C. Lawrence Zitnick,et al. Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Michael I. Jordan,et al. Joint Modeling of Multiple Related Time Series via the Beta Process , 2011, 1111.4226.

[22] Juan Carlos Niebles,et al. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[23] Jake K. Aggarwal,et al. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24] Deva Ramanan,et al. Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Ba Tu Truong,et al. Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[26] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Fei-Fei Li,et al. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29] Chih-Jen Lin,et al. Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Lei Chen,et al. Learning Action Primitives for Multi-level Video Event Understanding , 2014, ECCV Workshops.

[31] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[34] Claude E. Shannon,et al. The mathematical theory of communication , 1950 .

[35] Earl J. Wagner,et al. Cooking with Semantics , 2014, ACL 2014.

[36] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37] Michael I. Jordan,et al. JOINT MODELING OF MULTIPLE TIME SERIES VIA THE BETA PROCESS WITH APPLICATION TO MOTION CAPTURE SEGMENTATION , 2013, 1308.4747.

[38] Lucy Vanderwende,et al. Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[39] Cordelia Schmid,et al. The LEAR submission at Thumos 2014 , 2014 .

[40] Edwin Olson,et al. Single-Cluster Spectral Graph Partitioning for Robotics Applications , 2005, Robotics: Science and Systems.

[41] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[42] Cristian Sminchisescu,et al. Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43] Hema Swetha Koppula,et al. RoboBrain: Large-Scale Knowledge Engine for Robots , 2014, ArXiv.

[44] Sven J. Dickinson,et al. Video In Sentences Out , 2012, UAI.

[45] Thomas L. Griffiths,et al. Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[46] Cees G. M. Snoek,et al. University of Amsterdam at THUMOS Challenge 2014 , 2014 .

[47] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[48] B. Ripley,et al. Pattern Recognition , 1968, Nature.

[49] Dejan Pangercic,et al. Robotic roommates making pancakes , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[50] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51] David A. Forsyth,et al. Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[52] Eric P. Xing,et al. Reconstructing Storyline Graphs for Image Recommendation from Web Community Photos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53] T. Warren Liao,et al. Clustering of time series data - a survey , 2005, Pattern Recognit..

[54] Patrick Pérez,et al. Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[55] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[56] Fernando De la Torre,et al. Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[57] Pietro Perona,et al. A Factorization Approach to Grouping , 1998, ECCV.

[58] Silvio Savarese,et al. A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[59] Takeo Igarashi,et al. Generating photo manipulation tutorials by demonstration , 2009, ACM Trans. Graph..

[60] Eric P. Xing,et al. Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[62] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[63] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[64] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.