论文信息 - PyTorchVideo: A Deep Learning Library for Video Understanding

PyTorchVideo: A Deep Learning Library for Video Understanding

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

[1] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[2] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[4] Kaiming He,et al. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[9] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[10] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[11] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[13] Mario F. M. Campos,et al. A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[18] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Balaji Lakshminarayanan,et al. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2020, ICLR.

[20] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.