Scale-Aware Squeeze-and-Excitation for Lightweight Object Detection

Lightweight object detection can promote intelligent robotics to recognize surroundings objects with limited computational resources, and thus receives increasing attention in robotics communities. Recently, high-resolution networks (HRNets) can learn high-resolution representation and it obtains excellent performance as the backbones of current cutting-edge object detectors. However, two crucial issues remain with regard to applying HRNet-based detectors to mobile devices—insufficient local feature interactions and multiscale feature fusion. In this work, we propose a scale-aware squeeze-and-excitation (SASE) module that utilizes SE operations to fully explore feature interactions without increasing network complexity; this is followed by a scale-aware attention (SAA) mechanism, which adaptively fuses multiscale features by estimating the importance of each scale. The SASE module can serve as the basic block for the HRNet, which facilitates the use of HRNet as a backbone for lightweight object detection. Extensive experiments conducted on Microsoft COCO and Pascal VOC demonstrate that the proposed method has a good tradeoff between accuracy and model complexity. With similar numbers of parameters and calculations, the mean average precision (mAP) achieved on the COCO dataset is improved by 3.7% over that of Lite-HRNet.

[1]  Hao Li,et al.  Criteria Comparative Learning for Real-Scene Image Super-Resolution , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Hao Li,et al.  Real-World Image Super-Resolution by Exclusionary Dual-Learning , 2022, IEEE Transactions on Multimedia.

[3]  S. Schaal,et al.  Deformable One-Dimensional Object Detection for Routing and Manipulation , 2022, IEEE Robotics and Automation Letters.

[4]  Dianhai Yu,et al.  PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices , 2021, ArXiv.

[5]  Zeming Li,et al.  YOLOX: Exceeding YOLO Series in 2021 , 2021, ArXiv.

[6]  Changxin Gao,et al.  Lite-HRNet: A Lightweight High-Resolution Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yili Fu,et al.  Lightweight Deep Neural Network for Real-Time Instrument Semantic Segmentation in Robot Assisted Minimally Invasive Surgery , 2021, IEEE Robotics and Automation Letters.

[8]  Ming Liu,et al.  Ground-Aware Monocular 3D Object Detection for Autonomous Driving , 2021, IEEE Robotics and Automation Letters.

[9]  Chien-Yao Wang,et al.  Scaled-YOLOv4: Scaling Cross Stage Partial Network , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Hefeng Wu,et al.  Knowledge-Guided Multi-Label Few-Shot Learning for General Image Recognition , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Liang Lin,et al.  Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchmark and Adversarial Graph Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Shuicheng Yan,et al.  Rethinking Bottleneck Structure for Efficient Mobile Network Design , 2020, ECCV.

[13]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[14]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Zhaohui Zheng,et al.  Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.

[16]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[17]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yuan Wang,et al.  Focal Loss in 3D Object Detection , 2018, IEEE Robotics and Automation Letters.

[21]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[22]  Cyrill Stachniss,et al.  Fully Convolutional Networks With Sequential Information for Robust Crop and Weed Detection in Precision Farming , 2018, IEEE Robotics and Automation Letters.

[23]  Joseph Redmon,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[24]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[30]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[36]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[37]  Luc Van Gool,et al.  Efficient Non-Maximum Suppression , 2006, 18th International Conference on Pattern Recognition (ICPR'06).