Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization

In this work, we consider the problem of weakly-supervised multi-step localization in instructional videos. An established approach to this problem is to rely on a given list of steps. However, in reality, there is often more than one way to execute a procedure successfully, by following the set of steps in slightly varying orders. Thus, for successful localization in a given video, recent works require the actual order of procedure steps in the video, to be provided by human annotators at both training and test times. Instead, here, we only rely on generic procedural text that is not tied to a specific video. We represent the various ways to complete the procedure by transforming the list of instructions into a procedure flow graph which captures the partial order of steps. Using the flow graphs reduces both training and test time annotation requirements. To this end, we introduce the new problem of flow graph to video grounding. In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video. To solve this problem, we propose a new algorithm - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them. To show the advantage of our proposed formulation, we extend the CrossTask dataset with procedure flow graph information. Our experiments show that Graph2Vid is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation.

[1]  Jiebo Luo,et al.  Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Allan D. Jepson,et al.  Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers , 2021, NeurIPS.

[3]  Rohit Girdhar,et al.  Anticipative Video Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Allan D. Jepson,et al.  Representation Learning via Global Temporal Alignment and Cycle-Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Greg Mori,et al.  Learning Discriminative Prototypes with Dynamic Time Warping , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  C. Schmid,et al.  Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Yoko Yamakata,et al.  English Recipe Flow Graph Corpus , 2020, LREC.

[8]  Xilin Chen,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[9]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Juan Carlos Niebles,et al.  Procedure Planning in Instructional Videos , 2019, ECCV.

[11]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Ivan Laptev,et al.  Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yansong Tang,et al.  COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chirag Jain,et al.  On the Complexity of Sequence to Graph Alignment , 2019, bioRxiv.

[16]  Juan Carlos Niebles,et al.  D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Fadime Sener,et al.  Zero-Shot Anticipation for Instructional Activities , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Juergen Gall,et al.  NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Veli Mäkinen,et al.  Bit-parallel sequence-to-graph alignment , 2018, bioRxiv.

[20]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Naveen Sivadasan,et al.  Sequence Alignment on Directed Graphs , 2017, bioRxiv.

[23]  Marco Cuturi,et al.  Soft-DTW: a Differentiable Loss Function for Time-Series , 2017, ICML.

[24]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[25]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[26]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yejin Choi,et al.  Mise en Place: Unsupervised Interpretation of Instructional Recipes , 2015, EMNLP.

[28]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Ralph Bergmann,et al.  Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches , 2012, WWW.

[31]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[32]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[33]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[34]  Gonzalo Navarro,et al.  Improved approximate pattern matching on hypertext , 1998, Theor. Comput. Sci..

[35]  Yahiko Kambayashi,et al.  A longest common subsequence algorithm suitable for similar text strings , 1982, Acta Informatica.

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[37]  Alexander Koller,et al.  Aligning Actions Across Recipe Graphs , 2021, EMNLP.

[38]  Sanguthevar Rajasekaran,et al.  DTWNet: a Dynamic Time Warping Network , 2019, NeurIPS.

[39]  C. Weibel,et al.  An Introduction to Homological Algebra: References , 1960 .

[40]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .