Multimodal Object Detection via Bayesian Fusion

Object detection with multimodal inputs can improve many safety-critical perception systems such as autonomous vehicles (AVs). Motivated by AVs that operate in both day and night, we study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination. We explore strategies for fusing information from different modalities. Our key contribution is a non-learned late-fusion method that fuses together bounding box detections from different modalities via a simple probabilistic model derived from first principles. Our simple approach, which we call Bayesian Fusion, is readily derived from conditional independence assumptions across different modalities. We apply our approach to benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal sensor data. Our Bayesian Fusion outperforms prior work by more than 13% in relative performance.

[1]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Vineeth N Balasubramanian,et al.  Borrow From Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Alberto Del Bimbo,et al.  Task-Conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery , 2020, ECCV.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[6]  Shoaib Azam,et al.  Thermal Object Detection using Domain Adaptation through Style Consistency , 2020, ArXiv.

[7]  Xinhua Zhu,et al.  Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery , 2019, 2019 IEEE 5th International Conference on Computer and Communications (ICCC).

[8]  Hang Zhang,et al.  Multi-style Generative Network for Real-time Transfer , 2017, ECCV Workshops.

[9]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Chengyang Li,et al.  Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection , 2018, Pattern Recognit..

[11]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[12]  Michael Ying Yang,et al.  Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection , 2018, Inf. Fusion.

[13]  Shu Wang,et al.  Multispectral Deep Neural Networks for Pedestrian Detection , 2016, BMVC.

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[15]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[18]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  M. Degroot Optimal Statistical Decisions , 1970 .

[20]  Nicu Sebe,et al.  Learning Cross-Modal Deep Representations for Robust Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[22]  Xiangyu Zhu,et al.  Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[24]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[27]  Sven Behnke,et al.  Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks , 2016, ESANN.

[28]  Heiko Neumann,et al.  Fully Convolutional Region Proposal Networks for Multispectral Person Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Namil Kim,et al.  Multispectral pedestrian detection: Benchmark dataset and baseline , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[32]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Marco Bertini,et al.  Bottom-up and Layerwise Domain Adaptation for Pedestrian Detection in Thermal Images , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[34]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bernt Schiele,et al.  Learning Non-maximum Suppression , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Chengyang Li,et al.  Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation , 2018, BMVC.

[38]  Hong Qiao,et al.  Cross-modality interactive attention network for multispectral pedestrian detection , 2019, Inf. Fusion.

[39]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[40]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  A. Dawid Conditional Independence in Statistical Theory , 1979 .

[42]  Kihong Park,et al.  Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[43]  Heng Zhang,et al.  Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks , 2020, 2020 IEEE International Conference on Image Processing (ICIP).