论文信息 - Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos

Retro-Actions: Learning 'Close' by Time-Reversing 'Open' Videos

We investigate video transforms that result in class-homogeneous label-transforms. These are video transforms that consistently maintain or modify the labels of all videos in each class. We propose a general approach to discover invariant classes, whose transformed examples maintain their label; pairs of equivariant classes, whose transformed examples exchange their labels; and novel-generating classes, whose transformed examples belong to a new class outside the dataset. Label transforms offer additional supervision previously unexplored in video recognition benefiting data augmentation and enabling zero-shot learning opportunities by learning a class from transformed videos of its counterpart. Amongst such video transforms, we study horizontal-flipping, time-reversal, and their composition. We highlight errors in naively using horizontal-flipping as a form of data augmentation in video. Next, we validate the realism of time-reversed videos through a human perception study where people exhibit equal preference for forward and time-reversed videos. Finally, we test our approach on two datasets, Jester and Something-Something, evaluating the three video transforms for zero-shot learning and data augmentation. Our results show that gestures such as 'zooming in' can be learnt from 'zooming out' in a zero-shot setting, as well as more complex actions with state transitions such as 'digging something out of something' from 'burying something in something'.

Dima Damen | Will Price | D. Damen | Will Price

[1] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3] Sergey Levine,et al. Time Reversal as Self-Supervision , 2018, ArXiv.

[4] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[5] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[6] Irfan A. Essa,et al. Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Juan Carlos Niebles,et al. What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Longlong Jing,et al. Self-Supervised Spatiotemporal Feature Learning via Video Rotation Prediction. , 2018, 1811.11387.

[11] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14] Majid Mirmehdi,et al. Action Completion: A Temporal Model for Moment Detection , 2018, BMVC.

[15] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andrew Zisserman,et al. Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Tinne Tuytelaars,et al. Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Jonathan Tompson,et al. Temporal Reasoning in Videos Using Convolutional Gated Recurrent Units , 2018, CVPR Workshops.

[19] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Cees Snoek,et al. Video Time: Properties, Encoders and Evaluation , 2018, BMVC.

[23] Bernhard Schölkopf,et al. Seeing the Arrow of Time , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[25] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[26] Jonathon Shlens,et al. Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[27] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.