Cross-Task Weakly Supervised Learning From Instructional Videos

In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: ``pour egg'' should be trained jointly with other tasks involving ``pour'' and ``egg''. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.

[1]  Ali Farhadi,et al.  Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[3]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[6]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[9]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[10]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[12]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ivan Laptev,et al.  Joint Discovery of Object States and Manipulation Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[15]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Silvio Savarese,et al.  Demo2Vec: Reasoning Object Affordances from Online Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Juan Carlos Niebles,et al.  Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Fadime Sener,et al.  Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Jitendra Malik,et al.  From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[24]  Ivan Laptev,et al.  Learning from Narrated Instruction Videos , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[26]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[27]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[28]  Juergen Gall,et al.  Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[32]  Zaïd Harchaoui,et al.  DIFFRAC: a discriminative and flexible framework for clustering , 2007, NIPS.

[33]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[34]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[37]  Juan Carlos Niebles,et al.  Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.