Scaling up Kernels in 3D CNNs

Recent advances in 2D CNNs and vision transformers (ViTs) reveal that large kernels are essential for enough receptive fields and high performance. Inspired by this literature, we examine the feasibility and challenges of 3D large-kernel designs. We demonstrate that applying large convolutional kernels in 3D CNNs has more difficulties in both performance and efficiency. Existing techniques that work well in 2D CNNs are ineffective in 3D networks, including the popular depth-wise convolutions. To overcome these obstacles, we present the spatial-wise group convolution and its large-kernel module (SW-LK block). It avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, i.e. , LargeKernel3D, yields non-trivial improvements on various 3D tasks, including semantic segmentation and object detection. Notably, it achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1 st on the nuScenes LIDAR leaderboard. It is further boosted to 74.2% NDS with a simple multi-modal fusion. LargeKernel3D attains comparable or superior results than its CNN and transformer counterparts. For the first time, we show that large kernels are feasible and essential for 3D networks. Code and models will be available at github.com/dvlab-research/LargeKernel3D.

[1]  Jiaya Jia,et al.  Voxel Field Fusion for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kaicheng Yu,et al.  BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework , 2022, NeurIPS.

[3]  Huizi Mao,et al.  BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Jiaya Jia,et al.  Focal Sparse Convolutional Networks for 3D Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jiaya Jia,et al.  Stratified Transformer for 3D Point Cloud Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Chiew-Lan Tai,et al.  TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yilun Wang,et al.  FUTR3D: A Unified Sensor Fusion Framework for 3D Detection , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Jian Sun,et al.  Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  J. Zico Kolter,et al.  Patches Are All You Need? , 2022, Trans. Mach. Learn. Res..

[10]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Hang Zhao,et al.  Embracing Single Stride 3D Object Detector with Sparse Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Philipp Krähenbühl,et al.  Multimodal Virtual Point 3D Detection , 2021, NeurIPS.

[14]  Minzhe Niu,et al.  Voxel Transformer for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Minzhe Niu,et al.  Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  A. Dosovitskiy,et al.  Do Vision Transformers See Like Convolutional Neural Networks? , 2021, NeurIPS.

[17]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jiwen Lu,et al.  Global Filter Networks for Image Classification , 2021, NeurIPS.

[19]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[20]  Jingdong Wang,et al.  On the Connection between Local Attention and Dynamic Depth-wise Convolution , 2021, ICLR.

[21]  Xiaokang Yang,et al.  PointAugmenting: Cross-Modal Augmentation for 3D Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jakub M. Tomczak,et al.  CKConv: Continuous Kernel Convolution For Sequential Data , 2021, ICLR.

[23]  Ningning Ma,et al.  RepVGG: Making VGG-style ConvNets Great Again , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wengang Zhou,et al.  Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection , 2020, AAAI.

[25]  Gao Huang,et al.  3D Object Detection with Pointformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27]  Philipp Krähenbühl,et al.  Center-based 3D Object Detection and Tracking , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lei Zhang,et al.  Structure Aware Single-Stage 3D Object Detection From Point Cloud , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Zhuguo Li,et al.  PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yanan Sun,et al.  3DSSD: Point-Based 3D Single Stage Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Xiaogang Wang,et al.  PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  A. Yuille,et al.  Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots , 2019, ECCV.

[33]  A. Markham,et al.  RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alex H. Lang,et al.  PointPainting: Sequential Fusion for 3D Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Benjin Zhu,et al.  Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection , 2019, ArXiv.

[36]  Xiaogang Wang,et al.  From Points to Parts: 3D Object Detection From Point Cloud With Part-Aware and Part-Aggregation Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Silvio Savarese,et al.  4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Leonidas J. Guibas,et al.  KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jian Sun,et al.  DetNAS: Backbone Search for Object Detection , 2019, NeurIPS.

[42]  Jiong Yang,et al.  PointPillars: Fast Encoders for Object Detection From Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Xiaogang Wang,et al.  PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Wei Wu,et al.  PointCNN: Convolution On X-Transformed Points , 2018, NeurIPS.

[45]  Fuxin Li,et al.  PointConv: Deep Convolutional Networks on 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Bo Li,et al.  SECOND: Sparsely Embedded Convolutional Detection , 2018, Sensors.

[47]  Hongen Liao,et al.  Efficient Semantic Scene Completion Network with Spatial Group Convolution , 2018, ECCV.

[48]  Laurens van der Maaten,et al.  3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[51]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[52]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[55]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[61]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Feihu Zhang,et al.  Deep FusionNet for Point Cloud Semantic Segmentation , 2020, ECCV.

[64]  Alan L. Yuille,et al.  Every View Counts: Cross-View Consistency in 3D Object Detection with Hybrid-Cylindrical-Spherical Voxelization , 2020, NeurIPS.