VIPriors 3: Visual Inductive Priors for Data-Efficient Deep Learning Challenges

The third edition of the"VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning"workshop featured four data-impaired challenges, focusing on addressing the limitations of data availability in training deep learning models for computer vision tasks. The challenges comprised of four distinct data-impaired tasks, where participants were required to train models from scratch using a reduced number of training samples. The primary objective was to encourage novel approaches that incorporate relevant inductive biases to enhance the data efficiency of deep learning models. To foster creativity and exploration, participants were strictly prohibited from utilizing pre-trained checkpoints and other transfer learning techniques. Significant advancements were made compared to the provided baselines, where winning solutions surpassed the baselines by a considerable margin in all four tasks. These achievements were primarily attributed to the effective utilization of extensive data augmentation policies, model ensembling techniques, and the implementation of data-efficient training methods, including self-supervised representation learning. This report highlights the key aspects of the challenges and their outcomes.

[1]  Hongbin Wang,et al.  Task-Specific Data Augmentation and Inference Processing for VIPriors Instance Segmentation Challenge , 2022, ArXiv.

[2]  Gabriel Van Zandycke,et al.  DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations , 2022, MMSports@MM.

[3]  H. Liao,et al.  YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  J. V. Gemert,et al.  Evaluating Context for Deep Object Detectors , 2022, ArXiv.

[5]  Chang Huang,et al.  Sparse Instance Activation for Real-Time Instance Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[7]  S. L. Phung,et al.  DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jiajun Liang,et al.  Decoupled Knowledge Distillation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ari S. Morcos,et al.  Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[10]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Chi-Keung Tang,et al.  Mask Transfiner for High-Quality Instance Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jahongir Yunusov,et al.  Instance Segmentation Challenge Track Technical Report, VIPriors Workshop at ICCV 2021: Task-Specific Copy-Paste Data Augmentation Method for Instance Segmentation , 2021, ArXiv.

[14]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[15]  Qijun Zhao,et al.  SSPNet: Scale Selection Pyramid Network for Tiny Person Detection From UAV Images , 2021, IEEE Geoscience and Remote Sensing Letters.

[16]  Haibin Ling,et al.  CBNet: A Composite Backbone Network Architecture for Object Detection , 2021, IEEE Transactions on Image Processing.

[17]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jan C. van Gemert,et al.  Hallucination In Object Detection — A Study In Visual Part VERIFICATION , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[19]  Chien-Yao Wang,et al.  You Only Learn One Representation: Unified Network for Multiple Tasks , 2021, J. Inf. Sci. Eng..

[20]  Yutong Lin,et al.  Self-Supervised Learning with Swin Transformers , 2021, ArXiv.

[21]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ekin D. Cubuk,et al.  Revisiting ResNets: Improved Training and Scaling Strategies , 2021, NeurIPS.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[26]  Ying Wang,et al.  SWA Object Detection , 2020, ArXiv.

[27]  Quoc V. Le,et al.  Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kai Chen,et al.  Seesaw Loss for Long-Tailed Instance Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[30]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bolei Zhou,et al.  Temporal Pyramid Network for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  J. V. Gemert,et al.  On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hao Shao,et al.  Temporal Interlacing Network , 2020, AAAI.

[34]  Hengshuang Zhao,et al.  GridMask Data Augmentation , 2020, ArXiv.

[35]  Xin Zhao,et al.  TANet: Robust 3D Object Detection from Point Clouds with Triple Attention , 2019, AAAI.

[36]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Kyunghyun Cho,et al.  Augmentation for small object detection , 2019, 9th International Conference on Advances in Computing and Information Technology (ACITY 2019).

[43]  Zhi Zhang,et al.  Bag of Freebies for Training Object Detection Neural Networks , 2019, ArXiv.

[44]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[47]  Kaiming He,et al.  Group Normalization , 2018, International Journal of Computer Vision.

[48]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[49]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[52]  Jinhui Tang,et al.  CAD: Scale Invariant Framework for Real-Time Object Detection , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[53]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[55]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[56]  Larry S. Davis,et al.  Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[58]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[65]  Ted Belytschko,et al.  Multi-scale methods , 2000 .

[66]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Roman Solovyev,et al.  Weighted boxes fusion: Ensembling boxes from different object detection models , 2021, Image Vis. Comput..

[68]  Stan Z. Li,et al.  Unveiling the Power of Mixup for Stronger Classifiers , 2021 .

[69]  Derek Hoiem,et al.  Action Recognition , 2014, Computer Vision, A Reference Guide.