YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors

Real-time object detection is one of the most important research topics in computer vision. As new approaches regarding architecture optimization and training optimization are continually being developed, we have found two research topics that have spawned when dealing with these latest state-of-the-art methods. To address the topics, we propose a trainable bag-of-freebies oriented solution. We combine the flexible and efficient training tools with the proposed architecture and the compound scaling method. YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 120 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. Source code is released in https://github.com/WongKinYiu/yolov7.

[1]  Hongpeng Liao,et al.  Designing Network Design Strategies Through Gradient Path Analysis , 2022, J. Inf. Sci. Eng..

[2]  X. Zhang,et al.  Re-parameterizing Your Optimizers rather than Architectures , 2022, ICLR.

[3]  Jifeng Dai,et al.  Vision Transformer Adapter for Dense Predictions , 2022, ICLR.

[4]  Jianqiang Huang,et al.  Online Convolutional Reparameterization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Qingqing Dang,et al.  PP-YOLOE: An evolved version of YOLO , 2022, ArXiv.

[6]  Ross B. Girshick,et al.  Exploring Plain Vision Transformer Backbones for Object Detection , 2022, ECCV.

[7]  Lei Zhang,et al.  A Dual Weighting Label Assignment Scheme for Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jian Sun,et al.  Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  H. Shum,et al.  DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , 2022, ICLR.

[10]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Saehoon Kim,et al.  Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity , 2021, ICLR.

[13]  Chaojian Li,et al.  FBNetV5: Neural Architecture Search for Multiple Tasks in One Run , 2021, ArXiv.

[14]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Dianhai Yu,et al.  PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices , 2021, ArXiv.

[16]  Chuang Gan,et al.  MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning , 2021, NeurIPS.

[17]  Ping Luo,et al.  ByteTrack: Multi-Object Tracking by Associating Every Detection Box , 2021, ECCV.

[18]  Weilin Huang,et al.  TOOD: Task-aligned One-stage Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Sinan Kalkan,et al.  Rank & Sort Loss for Object Detection and Instance Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[21]  Haibin Ling,et al.  CBNet: A Composite Backbone Network Architecture for Object Detection , 2021, IEEE Transactions on Image Processing.

[22]  Xianzhi Du,et al.  Simple Training Strategies and Model Scaling for Object Detection , 2021, ArXiv.

[23]  Lu Yuan,et al.  Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Chien-Yao Wang,et al.  You Only Learn One Representation: Unified Network for Multiple Tasks , 2021, J. Inf. Sci. Eng..

[25]  Shengjin Wang,et al.  A2-FPN: Attention Aggregation based Feature Pyramid Network for Instance Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mingxing Tan,et al.  EfficientNetV2: Smaller Models and Faster Training , 2021, ICML.

[27]  Jiquan Ngiam,et al.  3D-MAN: 3D Multi-frame Attention Network for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zeming Li,et al.  OTA: Optimal Transport Assignment for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Guiguang Ding,et al.  Diverse Branch Block: Building a Convolution as an Inception-like Unit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ming Liu,et al.  YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Ekin D. Cubuk,et al.  Revisiting ResNets: Improved Training and Scaling Strategies , 2021, NeurIPS.

[32]  Ross B. Girshick,et al.  Fast and Accurate Model Scaling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Muhammed Tawfiq Chowdhury,et al.  Object Detection and Pose Estimation from RGB and Depth Data for Real-time, Adaptive Robotic Grasping , 2021, Advances in Computer Vision and Computational Biology.

[34]  Ningning Ma,et al.  RepVGG: Making VGG-style ConvNets Great Again , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nanning Zheng,et al.  End-to-End Object Detection with Fully Convolutional Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jun Li,et al.  Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Chien-Yao Wang,et al.  Scaled-YOLOv4: Scaling Cross Stage Partial Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[40]  Baris Can Cam,et al.  A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection , 2020, NeurIPS.

[41]  Ying Wang,et al.  VarifocalNet: An IoU-aware Dense Object Detector , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Thomas A. Geddes,et al.  Ensemble deep learning in bioinformatics , 2020, Nature Machine Intelligence.

[43]  Song Han,et al.  MCUNet: Tiny Deep Learning on IoT Devices , 2020, NeurIPS.

[44]  Hee Seok Lee,et al.  Probabilistic Anchor Assignment with IoU Prediction for Object Detection , 2020, ECCV.

[45]  Jian Sun,et al.  AutoAssign: Differentiable Label Assignment for Dense Object Detection , 2020, ArXiv.

[46]  Zhi Tian,et al.  FCOS: A Simple and Strong Anchor-Free Object Detector , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Jun Li,et al.  Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection , 2020, NeurIPS.

[48]  A. Yuille,et al.  DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[50]  Bo Chen,et al.  MobileDets: Searching for Object Detection Architectures for Mobile Accelerators , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jianguo Li,et al.  AP-Loss for Accurate One-Stage Object Detection , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[53]  Xinggang Wang,et al.  FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking , 2020, International Journal of Computer Vision.

[54]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Shifeng Zhang,et al.  Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jun-Wei Hsieh,et al.  CSPNet: A New Backbone that can Enhance Learning Capability of CNN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[57]  Chang Xu,et al.  GhostNet: More Features From Cheap Operations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Zhaohui Zheng,et al.  Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.

[60]  Ruigang Yang,et al.  IoU Loss for 2D/3D Object Detection , 2019, 2019 International Conference on 3D Vision (3DV).

[61]  Jungong Han,et al.  ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Zhigang Zeng,et al.  CLU-CNNs: Object detection for medical images , 2019, Neurocomputing.

[63]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[64]  Patric Jensfelt,et al.  Object Detection Approach for Robot Grasp Detection , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[65]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Jongyoul Park,et al.  An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[67]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[68]  Quoc V. Le,et al.  NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Hyuk-Jae Lee,et al.  Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Xiaogang Wang,et al.  GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Klaus C. J. Dietmayer,et al.  Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges , 2019, IEEE Transactions on Intelligent Transportation Systems.

[74]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  M. Salzmann,et al.  ExpandNets: Linear Over-parameterization to Train Compact Convolutional Networks , 2018, NeurIPS.

[76]  Klaus H. Maier-Hein,et al.  Retina U-Net: Embarrassingly Simple Exploitation of Segmentation Supervision for Medical Object Detection , 2018, ML4H@NeurIPS.

[77]  Zhiqiang Shen,et al.  Object Detection from Scratch with Deep Supervision , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[79]  Nima Tajbakhsh,et al.  UNet++: A Nested U-Net Architecture for Medical Image Segmentation , 2018, DLMIA/ML-CDS@MICCAI.

[80]  Joseph Redmon,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[81]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[82]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[83]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[86]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[87]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[88]  Antti Tarvainen,et al.  Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, NIPS.

[89]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[95]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[97]  Jeff J. Zhu,et al.  An Improved One millisecond Mobile Backbone , 2022, ArXiv.

[98]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).