论文信息 - Procedure Completion by Learning from Partial Summaries

Procedure Completion by Learning from Partial Summaries

We address the problem of procedure completion in videos, which is to find and localize all key-steps of a task given only a small observed subset of key-steps. We cast the problem as learning summarization from partial summaries that allows to incorporate prior knowledge and learn from incomplete key-steps. Given multiple pairs of (video, subset of key-steps), we address the problem by learning representations of input data and finding the remaining key-steps that generalizes well to key-step discovery in new videos. We propose a loss function on the parameters of a network that promotes to recover unseen key-steps that together with the observed key-steps optimize a desired subset selection criterion. To tackle the highly non-convex learning problem, involving both discrete and continuous variables, we develop an efficient learning algorithm that alternates between representation learning and recovering unseen key-steps while incorporating prior knowledge, via a greedy algorithm. By extensive experiments on two instructional video datasets, we show the effectiveness of our framework.

Ehsan Elhamifar | Zwe Naing | Ehsan Elhamifar | Zwe Naing

[1] Dima Damen,et al. Action Modifiers: Learning From Adverbs in Instructional Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[3] S. Shankar Sastry,et al. Dissimilarity-Based Sparse Subset Selection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Kevin Murphy,et al. What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[5] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[6] Ehsan Elhamifar,et al. Deep Supervised Summarization: Algorithm and Application to Learning Instructions , 2019, NeurIPS.

[7] Fadime Sener,et al. Zero-Shot Anticipation for Instructional Activities , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Juan Carlos Niebles,et al. D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Ehsan Elhamifar,et al. Sequential Facility Location: Approximate Submodularity and Greedy Algorithm , 2019, ICML.

[10] Ravishankar Krishnaswamy,et al. Relax, No Need to Round: Integrality of Clustering Formulations , 2014, ITCS.

[11] Hui Lin,et al. A Class of Submodular Functions for Document Summarization , 2011, ACL.

[12] Hui Lin,et al. Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[13] Alexander G. Hauptmann,et al. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[14] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[15] Jeff A. Bilmes,et al. Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[16] Juergen Gall,et al. NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Juan Carlos Niebles,et al. Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[18] C. Schmid,et al. Category-Specific Video Summarization , 2014, ECCV.

[19] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[20] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Juergen Gall,et al. Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Suvrit Sra,et al. Efficient Sampling for k-Determinantal Point Processes , 2015, AISTATS.

[24] Luc Van Gool,et al. Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Yi Li,et al. Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[26] Kristen Grauman,et al. Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[27] Ehsan Elhamifar,et al. Subset Selection and Summarization in Sequential Data , 2017, NIPS.

[28] Ehsan Elhamifar,et al. Online Summarization via Submodular and Convex Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Fadime Sener,et al. Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Ehsan Elhamifar,et al. Unsupervised Procedure Learning via Joint Dynamic Summarization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Emma Brunskill,et al. Learning Procedural Abstractions and Evaluating Discrete Latent Temporal Structure , 2018, ICLR.

[32] Chenliang Xu,et al. Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Dat Huynh,et al. Self-supervised Multi-task Procedure Learning from Instructional Videos , 2020, ECCV.

[34] Michael Möller,et al. A Convex Model for Nonnegative Matrix Factorization and Dimensionality Reduction on Physical Space , 2011, IEEE Transactions on Image Processing.

[35] Ben Taskar,et al. Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[36] Tianbao Yang,et al. Improving Sequential Determinantal Point Processes for Supervised Video Summarization , 2018, ECCV.

[37] Rachel Ward,et al. Recovery guarantees for exemplar-based clustering , 2013, Inf. Comput..

[38] Andreas Krause,et al. Submodular Function Maximization , 2014, Tractability.

[39] Rishabh K. Iyer,et al. Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[40] Ben Taskar,et al. Expectation-Maximization for Learning Determinantal Point Processes , 2014, NIPS.

[41] Fadime Sener,et al. Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Russ Bubley,et al. Randomized algorithms , 1995, CSUR.

[43] Jade Goldstein-Stewart,et al. The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[44] Jun Li,et al. Weakly Supervised Energy-Based Learning for Action Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[46] Ryen W. White,et al. Time-critical search , 2014, SIGIR.

[47] Ke Zhang,et al. Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Kristen Grauman,et al. Large-Margin Determinantal Point Processes , 2014, UAI.

[49] Guillermo Sapiro,et al. See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Andreas Krause,et al. Guarantees for Greedy Maximization of Non-submodular Functions with Applications , 2017, ICML.

[51] Silvio Savarese,et al. Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.