What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision

We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

[1]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Bo Pang,et al.  Spice it up? Mining Refinements to Online Instructions from User Generated Content , 2012, ACL.

[3]  Jiebo Luo,et al.  Unsupervised Alignment of Natural Language Instructions with Video Segments , 2014, AAAI.

[4]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[5]  Hema Swetha Koppula,et al.  RoboBrain: Large-Scale Knowledge Engine for Robots , 2014, ArXiv.

[6]  James Ze Wang,et al.  The Story Picturing Engine---a system for automatic text illustration , 2006, TOMCCAP.

[7]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[13]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[14]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[17]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[18]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[19]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[20]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[21]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..