Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available.1

[1]  Sergio Escalera,et al.  Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey , 2017, Gesture Recognition.

[2]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[3]  Hilde Kuehne,et al.  Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Juergen Gall,et al.  Fast Weakly Supervised Action Segmentation Using Mutual Consistency , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[9]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[10]  C. L. Mallows NON-NULL RANKING MODELS. I , 1957 .

[11]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Juergen Gall,et al.  Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Bingbing Ni,et al.  Multiple Granularity Analysis for Fine-Grained Action Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Juergen Gall,et al.  SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[16]  Jun Li,et al.  Weakly Supervised Energy-Based Learning for Action Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Hilde Kuehne,et al.  A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[21]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sid Ying-Ze Bao,et al.  Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Fadime Sener,et al.  Unsupervised Learning and Segmentation of Complex Activities from Video , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Chenliang Xu,et al.  TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation , 2017, ArXiv.

[27]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sudeep Sarkar,et al.  A Perceptual Prediction Framework for Self Supervised Event Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[30]  Juergen Gall,et al.  NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Juan Carlos Niebles,et al.  D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Juan Carlos Niebles,et al.  Connectionist Temporal Modeling for Weakly Supervised Action Labeling , 2016, ECCV.

[35]  Fadime Sener,et al.  Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[37]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  M. Saquib Sarfraz,et al.  Efficient Parameter-Free Clustering Using First Neighbor Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).