论文信息 - t-EVA: Time-Efficient t-SNE Video Annotation

t-EVA: Time-Efficient t-SNE Video Annotation

Video understanding has received more attention in the past few years due to the availability of several large-scale video datasets. However, annotating large-scale video datasets are cost-intensive. In this work, we propose a time-efficient video annotation method using spatio-temporal feature similarity and t-SNE dimensionality reduction to speed up the annotation process massively. Placing the same actions from different videos near each other in the two-dimensional space based on feature similarity helps the annotator to group-label video clips. We evaluate our method on two subsets of the ActivityNet (v1.3) and a subset of the Sports-1M dataset. We show that t-EVA can outperform other video annotation tools while maintaining test accuracy on video classification.

[1] Gang Wang,et al. Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2] Qiang Wang,et al. Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Wu Liu,et al. T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[4] Lorenzo Torresani,et al. C3D: Generic Features for Video Analysis , 2014, ArXiv.

[5] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6] Abhishek Dutta,et al. The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[7] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[9] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Yutaka Satoh,et al. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[11] Abhinav Gupta,et al. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Elmar Eisemann,et al. GPGPU Linear Complexity t-SNE Optimization , 2018, IEEE Transactions on Visualization and Computer Graphics.

[13] Yi Yang,et al. A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Arnold W. M. Smeulders,et al. Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Shih-Fu Chang,et al. Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs , 2016, ArXiv.

[17] Andreas Kerren,et al. t-viSNE: Interactive Assessment and Interpretation of t-SNE Projections , 2020, IEEE Transactions on Visualization and Computer Graphics.

[18] Michel Verleysen,et al. Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[19] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Martin Wattenberg,et al. How to Use t-SNE Effectively , 2016 .

[21] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Michael J. Black,et al. On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[24] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[26] Stefan Steinerberger,et al. Clustering with t-SNE, provably , 2017, SIAM J. Math. Data Sci..

[27] Mubarak Shah,et al. A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[28] Francois P. S. Luus,et al. Active Learning with TensorBoard Projector , 2019, ArXiv.

[29] Ming Zeng,et al. Semi-supervised convolutional neural networks for human activity recognition , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[30] Luc Van Gool,et al. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[31] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[32] Chen Sun,et al. DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks , 2018, ArXiv.

[33] Lorenzo Torresani,et al. DistInit: Learning Video Representations Without a Single Labeled Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Wei Wu,et al. Distractor-aware Siamese Networks for Visual Object Tracking , 2018, ECCV.

[35] James Ze Wang,et al. Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[37] Leland McInnes,et al. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[38] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[39] Davide Modolo,et al. Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[41] Jiajun Wu,et al. Deep multiple instance learning for image classification and auto-annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).