论文信息 - Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called RepNet, with a synthetic dataset that is generated from a large unlabeled video collection by sampling short clips of varying lengths and repeating them with different periods and counts. This combination of synthetic data and a powerful yet constrained model, allows us to predict periods in a class-agnostic fashion. Our model substantially exceeds the state of the art performance on existing periodicity (PERTUBE) and repetition counting (QUVA) benchmarks. We also collect a new challenging dataset called Countix (~90 times larger than existing datasets) which captures the challenges of repetition counting in real-world videos. Project webpage: https://sites.google.com/view/repnet .

[1] E. Adelson,et al. Analyzing gait with spatiotemporal surfaces , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Xavier Binefa,et al. Robust Real-Time Periodic Motion Detection, Analysis, and Applications , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4] Larry S. Davis,et al. EigenGait: Motion-Based Recognition of People Using Image Self-Similarity , 2001, AVBPA.

[5] Serge J. Belongie,et al. Structure from Periodic Motion , 2004, SCVMA.

[6] Larry S. Davis,et al. Gait Recognition Using Image Self-Similarity , 2004, EURASIP J. Adv. Signal Process..

[7] Steven M. Seitz,et al. View-Invariant Analysis of Cyclic Motion , 1997, International Journal of Computer Vision.

[8] Philip S. Yu,et al. On Periodicity Detection and Structural Periodic Similarity , 2005, SDM.

[9] Petre Stoica,et al. Spectral Analysis of Signals , 2009 .

[10] Eli Shechtman,et al. Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Arnold W. M. Smeulders,et al. Visual quasi-periodicity , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Andrew Zisserman,et al. Learning To Count Objects in Images , 2010, NIPS.

[13] Patrick Pérez,et al. View-Independent Action Recognition from Temporal Self-Similarities , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Frédo Durand,et al. Eulerian video magnification for revealing subtle changes in the world , 2012, ACM Trans. Graph..

[15] Dezhen Song,et al. Automatic bird species detection using periodicity of salient extremities , 2013, 2013 IEEE International Conference on Robotics and Automation.

[16] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Lior Wolf,et al. Live Repetition Counting , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Hassan Foroosh,et al. Exploring Sparseness and Self-Similarity for Action Recognition , 2015, IEEE Transactions on Image Processing.

[20] Srinivas S. Kruthiventi,et al. CrowdNet: A Deep Convolutional Network for Dense Crowd Counting , 2016, ACM Multimedia.

[21] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[22] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[25] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[26] Andrew Zisserman,et al. Counting in the Wild , 2016, ECCV.

[27] Martial Hebert,et al. Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[29] Cordelia Schmid,et al. Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[32] Jana Kosecka,et al. Synthesizing Training Data for Object Detection in Indoor Scenes , 2017, Robotics: Science and Systems.

[33] Antonis A. Argyros,et al. Unsupervised Detection of Periodic Segments in Videos , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[34] Sergey Levine,et al. Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35] Andrew Zisserman,et al. Class-Agnostic Counting , 2018, ACCV.

[36] Yaser Sheikh,et al. Structure from Recurrent Motion: From Rigidity to Recurrency , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[38] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[40] Paulina J. M. Bank,et al. Hand-tremor frequency estimation in videos , 2018, ECCV Workshops.

[41] Jonathan Tompson,et al. Learning Actionable Representations from Visual Observations , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42] Varun Jampani,et al. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43] Andrew Zisserman,et al. Microscopy cell counting and detection with fully convolutional regression networks , 2018, Comput. methods Biomech. Biomed. Eng. Imaging Vis..

[44] Rahul Sukthankar,et al. Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Arnold W. M. Smeulders,et al. Real-World Repetition Estimation by Div, Grad and Curl , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[47] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[48] Yusuke V. Morimoto,et al. Collective cell migration of Dictyostelium without cAMP oscillations at multicellular stages , 2019, Communications Biology.

[49] Andrew Zisserman,et al. The Visual Centrifuge: Model-Free Layered Video Representations , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Antonis Argyros,et al. ReActNet: Temporal Localization of Repetitive Activities in Real-World Videos , 2019, ArXiv.

[51] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52] Arkadiusz Stopczynski,et al. Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).