Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., dressing) to the action (e.g., mix yogurt) that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervision. We address these challenges by learning a joint visual-linguistic model, where linguistic cues can help resolve visual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially improve upon state-of-the-art linguistic only model on reference resolution in instructional videos.

[1]  Tamara L. Berg,et al.  Names and faces in the news , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[2]  David Schlangen,et al.  Incremental Reference Resolution: The Task, Metrics for Evaluation, and a Bayesian Filtering Model that is Sensitive to Disfluencies , 2009, SIGDIAL Conference.

[3]  Jeffrey Nichols,et al.  Interpreting Written How-To Instructions , 2009, IJCAI.

[4]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Tat-Seng Chua,et al.  An efficient sparse metric learning in high-dimensional space via l1-penalized log-determinant regularization , 2009, ICML '09.

[6]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Cyrus Rashtchian,et al.  Cross-Caption Coreference Resolution for Automatic Image Understanding , 2010, CoNLL.

[8]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[9]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[10]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[13]  Stefanie Tellex,et al.  Learning perceptually grounded word meanings from unaligned parallel data , 2012, Machine Learning.

[14]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Sanja Fidler,et al.  A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jayant Krishnamurthy,et al.  Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[19]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[20]  Sanja Fidler,et al.  What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[22]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[23]  Earl J. Wagner,et al.  Cooking with Semantics , 2014, ACL 2014.

[24]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[27]  Jonas Kuhn,et al.  Learning Structured Perceptrons for Coreference Resolution with Latent Antecedents and Non-local Features , 2014, ACL.

[28]  Fei-Fei Li,et al.  Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[29]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[30]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[31]  Shinsuke Mori,et al.  A Framework for Procedural Text Understanding , 2015, IWPT.

[32]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[35]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[36]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[37]  Nizar Habash,et al.  Predicting the Structure of Cooking Recipes , 2015, EMNLP.

[38]  Jiebo Luo,et al.  Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.

[39]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[41]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[42]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Michael Strube,et al.  Latent Structures for Coreference Resolution , 2015, TACL.

[44]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Dan Klein,et al.  Alignment-Based Compositional Semantics for Instruction Following , 2015, EMNLP.

[46]  Lei Chen,et al.  Deep Structured Models For Group Activity Recognition , 2015, BMVC.

[47]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yejin Choi,et al.  Mise en Place: Unsupervised Interpretation of Instructional Recipes , 2015, EMNLP.

[49]  Ali Farhadi,et al.  Generating Notifications for Missing Actions: Don't Forget to Turn the Lights Off! , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[51]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[52]  Xinlei Chen,et al.  Learning Visual Storylines with Skipping Recurrent Neural Networks , 2016, ECCV.

[53]  Dhruv Batra,et al.  Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[54]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Percy Liang,et al.  Simpler Context-Dependent Logical Forms via Model Projections , 2016, ACL.

[56]  Larry S. Davis,et al.  Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[57]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[58]  Song-Chun Zhu,et al.  Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration , 2016, EMNLP.

[59]  Changsong Liu,et al.  Grounded Semantic Role Labeling , 2016, NAACL.

[60]  Song-Chun Zhu,et al.  Robot learning with a spatial, temporal, and causal and-or graph , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[61]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Ivan Laptev,et al.  Joint Discovery of Object States and Manipulation Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).