MegDet: A Large Mini-Batch Object Detector

The development of object detection in the era of deep learning, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from novel network, new framework, or loss design. However, mini-batch size, a key factor for the training of deep neural networks, has not been well studied for object detection. In this paper, we propose a Large Mini-Batch Object Detector (MegDet) to enable the training with a large minibatch size up to 256, so that we can effectively utilize at most 128 GPUs to significantly shorten the training time. Technically, we suggest a warmup learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[3]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[4]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[5]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[7]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[13]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Nikos Komodakis,et al.  Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Abhinav Gupta,et al.  Contextual Priming and Feedback for Faster R-CNN , 2016, ECCV.

[18]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[19]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[21]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[22]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[26]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[30]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[32]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yuning Jiang,et al.  What Can Help Pedestrian Detection? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jian Dong,et al.  Attentive Contexts for Object Detection , 2016, IEEE Transactions on Multimedia.

[35]  Jian Sun,et al.  Object Detection Networks on Convolutional Feature Maps , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[38]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[40]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.