Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport. In particular, we incorporate a temporal regularization term which preserves the temporal order of the activity into the standard optimal transport module for computing pseudolabel cluster assignments. The temporal optimal transport module enables our approach to learn effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.

[1]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Abhinav Gupta,et al.  ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sid Ying-Ze Bao,et al.  Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Dacheng Tao,et al.  Self-Supervised Representation Learning by Rotation Feature Decoupling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Fadime Sener,et al.  Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Joost van de Weijer,et al.  Leveraging Unlabeled Data for Crowd Counting by Learning to Rank , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Shih-Fu Chang,et al.  CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chen Sun,et al.  DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks , 2018, ArXiv.

[10]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Julien Mairal,et al.  Unsupervised Pre-Training of Image Features on Non-Curated Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[14]  Sridha Sridharan,et al.  Predicting the Future: A Jointly Learnt Model for Action Anticipation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Mubarak Shah,et al.  Real-World Anomaly Detection in Surveillance Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[17]  Chengxu Zhuang,et al.  Local Aggregation for Unsupervised Learning of Visual Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[19]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Hilde Kuehne,et al.  A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Sinisa Todorovic,et al.  Set-Constrained Viterbi for Set-Supervised Action Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[23]  Gregory D. Hager,et al.  Deep Supervision with Intermediate Concepts , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Matthieu Cord,et al.  Learning Representations by Predicting Bags of Visual Words , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Gregory Shakhnarovich,et al.  Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Sinisa Todorovic,et al.  Action Shuffle Alternating Learning for Unsupervised Action Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[32]  Yoshua Bengio,et al.  Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.

[33]  Mubarak Shah,et al.  Unsupervised Discriminative Embedding For Sub-Action Learning in Complex Activities , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[34]  Fadime Sener,et al.  Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yueting Zhuang,et al.  Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Björn Ommer,et al.  CliqueCNN: Deep Unsupervised Exemplar Learning , 2016, NIPS.

[37]  Jun Li,et al.  Weakly Supervised Energy-Based Learning for Action Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  In-So Kweon,et al.  Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles , 2018, AAAI.

[39]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Gregory Shakhnarovich,et al.  Learning Representations for Automatic Colorization , 2016, ECCV.

[41]  Paolo Favaro,et al.  Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Juergen Gall,et al.  SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Shenghuo Zhu,et al.  Deep Learning of Invariant Features via Simulated Fixations in Video , 2012, NIPS.

[44]  Jonathan Tompson,et al.  Temporal Cycle-Consistency Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[46]  Juergen Gall,et al.  Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[48]  Gregory D. Hager,et al.  Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Hilde Kuehne,et al.  Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[50]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[54]  Loong Fah Cheong,et al.  Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[55]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[58]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Luc Van Gool,et al.  DynamoNet: Dynamic Action and Motion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[61]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[62]  Paul Vernaza,et al.  Hierarchical Metric Learning and Matching for 2D and 3D Geometric Correspondences , 2018, ECCV.

[63]  Huseyin Coskun,et al.  Learning by Aligning Videos in Time , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[65]  Fabio Maria Carlucci,et al.  Domain Generalization by Solving Jigsaw Puzzles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Will Y. Zou Unsupervised learning of visual invariance with temporal coherence , 2011 .

[67]  M. Zeeshan Zia,et al.  Towards Anomaly Detection in Dashcam Videos , 2020, 2020 IEEE Intelligent Vehicles Symposium (IV).

[68]  Gang Hua,et al.  Order-Preserving Wasserstein Distance for Sequence Matching , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  In-So Kweon,et al.  Learning Image Representations by Completing Damaged Jigsaw Puzzles , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[71]  Svetha Venkatesh,et al.  Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Gaurav Sharma,et al.  Shuffle and Attend: Video Domain Adaptation , 2020, ECCV.

[73]  Shaogang Gong,et al.  Unsupervised Deep Learning by Neighbourhood Discovery , 2019, ICML.

[74]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[75]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[77]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Yazan Abu Farha,et al.  Temporal Action Segmentation from Timestamp Supervision , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[81]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[83]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[84]  Sudeep Sarkar,et al.  A Perceptual Prediction Framework for Self Supervised Event Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[86]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[87]  Juergen Gall,et al.  NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[89]  Kevin Murphy,et al.  What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision , 2015, NAACL.

[90]  Juan Carlos Niebles,et al.  D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[92]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.