Action Segmentation with Mixed Temporal Domain Adaptation

The main progress for action segmentation comes from densely-annotated data for fully-supervised learning. Since manual annotation for frame-level actions is time-consuming and challenging, we propose to exploit auxiliary unlabeled videos, which are much easier to obtain, by shaping this problem as a domain adaptation (DA) problem. Although various DA techniques have been proposed in recent years, most of them have been developed only for the spatial direction. Therefore, we propose Mixed Temporal Domain Adaptation (MTDA) to jointly align frame-and video-level embedded feature spaces across domains, and further integrate with the domain attention mechanism to focus on aligning the frame-level features with higher domain discrepancy, leading to more effective domain adaptation. Finally, we evaluate our proposed methods on three challenging datasets (GTEA, 50Salads, and Breakfast), and validate that MTDA outperforms the current state-of-the-art methods on all three datasets by large margins (e.g. 6.4% gain on F1@50 and 6.8% gain on the edit score for GTEA).

[1]  Qilong Wang,et al.  Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Gabriela Csurka,et al.  A Comprehensive Survey on Domain Adaptation for Visual Applications , 2017, Domain Adaptation in Computer Vision Applications.

[3]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[5]  Jianmin Wang,et al.  Transferable Attention for Domain Adaptation , 2019, AAAI.

[6]  Jiaying Liu,et al.  Adaptive Batch Normalization for practical domain adaptation , 2018, Pattern Recognit..

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Sinisa Todorovic,et al.  Temporal Deformable Residual Networks for Action Segmentation in Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[10]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  K. S. Venkatesh,et al.  Deep Domain Adaptation in Action Space , 2018, BMVC.

[12]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[13]  Juergen Gall,et al.  Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[14]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Edwin Lughofer,et al.  Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning , 2017, ICLR.

[16]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Juergen Gall,et al.  Temporal Action Detection Using a Statistical Language Model , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Ruxin Chen,et al.  Temporal Attentive Alignment for Large-Scale Video Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[23]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[24]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[25]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Imran Saleemi,et al.  Human Action Recognition across Datasets by Foreground-Weighted Histogram Decomposition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[29]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michael I. Jordan,et al.  Unsupervised Domain Adaptation with Residual Transfer Networks , 2016, NIPS.

[31]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[32]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[33]  Jiaying Liu,et al.  Revisiting Batch Normalization For Practical Domain Adaptation , 2016, ICLR.

[34]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[35]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[37]  Dong Xu,et al.  Collaborative and Adversarial Network for Unsupervised Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Edward K. Wong,et al.  Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition , 2016, Image Vis. Comput..

[41]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[42]  Sridha Sridharan,et al.  Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[44]  Chenliang Xu,et al.  Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Chenliang Xu,et al.  TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation , 2017, ArXiv.