Voxelized 3D Feature Aggregation for Multiview Detection

Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the stateof-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object’s height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent. Additionally, because different kinds of objects (human vs. cattle) have different shapes on the ground plane, we introduce the oriented Gaussian encoding to match such shapes, leading to increased accuracy and efficiency. We perform experiments on multiview 2D detection and multiview 3D detection problems. Results on four datasets (including a newly introduced MultiviewC dataset) show that our system is very competitive compared with the state-ofthe-art approaches. Code and MultiviewC are released at https://github.com/Robert-Mar/VFA.

[1]  Andrew Zisserman,et al.  A Geometric Approach to Obtain a Bird's Eye View From an Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2]  Yang Liu,et al.  Multi-view People Tracking via Hierarchical Trajectory Composition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Danfei Xu,et al.  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Liang Zheng,et al.  Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation) , 2021, ACM Multimedia.

[6]  Deng Cai,et al.  Training-Time-Friendly Network for Real-Time Object Detection , 2020, AAAI.

[7]  Huimin Ma,et al.  3D Object Proposals for Accurate Object Class Detection , 2015, NIPS.

[8]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tatjana Chavdarova,et al.  Deep Multi-camera People Detection , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[10]  Stephen Gould,et al.  Multiview Detection with Feature Perspective Transformation , 2020, ECCV.

[11]  Nicholay Topin,et al.  Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.

[12]  Pascal Fua,et al.  Conditional Random Fields for multi-camera object detection , 2011, 2011 International Conference on Computer Vision.

[13]  K. Kenthapadi,et al.  LiFT , 2020, Proceedings of the 29th ACM International Conference on Information & Knowledge Management.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Xiaogang Wang,et al.  GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Junsong Yuan,et al.  Stacked Homography Transformations for Multi-View Pedestrian Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[19]  Pascal Fua,et al.  Deep Occlusion Reasoning for Multi-camera Multi-target Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[21]  Steven Lake Waslander,et al.  Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Veronica Teichrieb,et al.  Generalizable Multi-Camera 3D Pedestrian Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  W. Marsden I and J , 2012 .

[25]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Roberto Cipolla,et al.  Orthographic Feature Transform for Monocular 3D Object Detection , 2018, BMVC.

[27]  Luc Van Gool,et al.  WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Ma Mingjun,et al.  MVM3Det: A Novel Method for Multi-view Monocular 3D Detection , 2021, 2109.10473.

[29]  Shaojie Shen,et al.  Stereo R-CNN Based 3D Object Detection for Autonomous Driving , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Junchi Yan,et al.  Arbitrary-Oriented Object Detection with Circular Smooth Label , 2020, ECCV.

[31]  P. R. Smith,et al.  Bilinear interpolation of digital images , 1981 .

[32]  Ji Wan,et al.  Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ming-Shi Wang,et al.  A Vision Based Top-View Transformation Model for a Vehicle Parking Assistant , 2012, Sensors.