论文信息 - UnweaveNet: Unweaving Activity Stories

UnweaveNet: Unweaving Activity Stories

Our lives can be seen as a complex weaving of activities; we switch from one activity to another, to maximise our achievements or in reaction to demands placed upon us. Observing a video of unscripted daily activities, we parse the video into its constituent activity threads through a process we call unweaving. To accomplish this, we introduce a video representation explicitly capturing activity threads called a thread bank, along with a neural controller capable of detecting goal changes and resuming of past activities, together forming UnweaveNet. We train and evaluate UnweaveNet on sequences from the unscripted egocentric dataset EPIC-KITCHENS. We propose and showcase the efficacy of pretraining UnweaveNet in a self-supervised manner.

Dima Damen | Will Price | Carl Vondrick

[1] Yong-Yeol Ahn,et al. The Impact of Random Models on Clustering Similarity , 2017, bioRxiv.

[2] Jason Nolan,et al. Sousveillance: Inventing and Using Wearable Computing Devices for Data Collection in Surveillance Environments. , 2002 .

[3] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4] Jason Weston,et al. Memory Networks , 2014, ICLR.

[5] Juan Carlos Niebles,et al. Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[6] Navdeep Jaitly,et al. Pointer Networks , 2015, NIPS.

[7] Sudeep Sarkar,et al. A Perceptual Prediction Framework for Self Supervised Event Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Weiyao Wang,et al. Generic Event Boundary Detection: A Benchmark for Event Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Boon-Lock Yeo,et al. Video browsing using clustering and scene transitions on compressed sequences , 1995, Electronic Imaging.

[10] Kristen Grauman,et al. Ego-Topo: Environment Affordances From Egocentric Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] In-So Kweon,et al. Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[12] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[13] Fadime Sener,et al. Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[15] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] B. Tversky,et al. Making sense of abstract events: Building event schemas , 2006, Memory & cognition.

[17] Hilde Kuehne,et al. Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18] Zhifeng Li,et al. Boundary-Aware Cascade Networks for Temporal Action Segmentation , 2020, ECCV.

[19] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[20] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..

[21] Kaiming He,et al. Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[22] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[23] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[24] Wojciech Zaremba,et al. Learning Simple Algorithms from Examples , 2015, ICML.

[25] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines , 2015, ArXiv.

[26] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[27] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[28] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[29] Marcin Andrychowicz,et al. Neural Random Access Machines , 2015, ERCIM News.

[30] Yazan Abu Farha,et al. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Tanaya Guha,et al. An Online Algorithm for Constrained Face Clustering in Videos , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[32] Makarand Tapaswi,et al. StoryGraphs: Visualizing Character Interactions as a Timeline , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Nando de Freitas,et al. Neural Programmer-Interpreters , 2015, ICLR.

[34] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[35] Gregory D. Hager,et al. Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Luc Van Gool,et al. Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Michael S. Ryoo,et al. Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[38] Bolei Zhou,et al. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Chen Liang,et al. Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[41] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Jeffrey M. Zacks,et al. Perceiving, remembering, and communicating structure in events. , 2001, Journal of experimental psychology. General.

[43] Phil Blunsom,et al. Learning to Transduce with Unbounded Memory , 2015, NIPS.

[44] Tomas Mikolov,et al. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[45] Ping Li,et al. Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization , 2019, AAAI.

[46] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47] Michael Lam,et al. Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Jeffrey M. Zacks,et al. Human brain activity time-locked to perceptual event boundaries , 2001, Nature Neuroscience.

[49] Yang Wang,et al. Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[50] Ramakant Nevatia,et al. Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images , 2015, ACM Multimedia.