Multimodal Object Detection via Probabilistic Ensembling

. Object detection with multimodal inputs can improve many safety-critical systems such as autonomous vehicles (AVs). Motivated by AVs that operate in both day and night, we study multimodal object detection with RGB and thermal cameras, since the latter provides much stronger object signatures under poor illumination. We explore strategies for fusing information from different modalities. Our key contribution is a probabilistic ensembling technique, ProbEn , a simple non-learned method that fuses together detections from multi-modalities. We derive ProbEn from Bayes’ rule and first principles that assume conditional independence across modalities. Through probabilistic marginalization, ProbEn elegantly handles missing modalities when detectors do not fire on the same object. Importantly, ProbEn also notably improves multimodal detection even when the conditional independence assumption does not hold, e.g., fusing outputs from other fusion methods (both off-the-shelf and trained in-house). We validate ProbEn on two benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal images, showing that ProbEn outperforms prior work by more than 13% in relative performance! performs quite well on both Day and Night ; average score fusion performs poorly because it double counts class prior. As for box fusion, using the learned variance / uncertainty by v-avg performs better than the heuristic methods (avg and s-avg). Our ProbEn performs significantly better and ProbEn 3 is the best by fusing three models: RGB, Thermal, and MidFusion. Image pairs are split into train-set (8 , 862 images) and a validation set (1 , 366 images). FLIR evaluates on three classes which have imbalanced examples [8,30,60,41,12]: 28 , 151 persons, 46 , 692 cars, and 4 , 457 bicycles. Following [60], we remove 108 thermal images in the val-set that do not have the RGB counterparts. For breakdown analysis w.r.t day/night scenes, we manually tag the validation images with “day” (768) and “night” (490). We will release our annotations to the public.

[1]  Jiwon Kim,et al.  MLPD: Multi-Label Pedestrian Detector in Multispectral Domain , 2021, IEEE Robotics and Automation Letters.

[2]  Marco Bertini,et al.  Bottom-up and Layerwise Domain Adaptation for Pedestrian Detection in Thermal Images , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[3]  Abhinav Valada,et al.  There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Heng ZHANG,et al.  Guided Attentive Feature Fusion for Multispectral Pedestrian Detection , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Sedat Ozer,et al.  SyNet: An Ensemble Network for Object Detection in UAV Images , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[6]  Roman Solovyev,et al.  Weighted boxes fusion: Ensembling boxes from different object detection models , 2021, Image Vis. Comput..

[7]  Heng Zhang,et al.  Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[8]  Xun Cao,et al.  Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems , 2020, ECCV.

[9]  Zehui Chen,et al.  1st Place Solutions of Waymo Open Dataset Challenge 2020 - 2D Object Detection Track , 2020, ArXiv.

[10]  Shoaib Azam,et al.  Thermal Object Detection using Domain Adaptation through Style Consistency , 2020, ArXiv.

[11]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[13]  Alberto Del Bimbo,et al.  Task-Conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery , 2020, ECCV.

[14]  Xinhua Zhu,et al.  Every Feature Counts: An Improved One-Stage Detector in Thermal Imagery , 2019, 2019 IEEE 5th International Conference on Computer and Communications (ICCC).

[15]  Yuan Feng,et al.  2nd Place Solution in Google AI Open Images Object Detection Track 2019 , 2019, ArXiv.

[16]  Hong Qiao,et al.  Cross-modality interactive attention network for multispectral pedestrian detection , 2019, Inf. Fusion.

[17]  Vineeth N Balasubramanian,et al.  Borrow From Anywhere: Pseudo Multi-Modal Object Detection in Thermal Imagery , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Xiangyu Zhu,et al.  Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Chengyang Li,et al.  Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection , 2018, Pattern Recognit..

[21]  Michael Ying Yang,et al.  Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection , 2018, Inf. Fusion.

[22]  Takuya Akiba,et al.  PFDet: 2nd Place Solution to Open Images Challenge 2018 Object Detection Track , 2018, ArXiv.

[23]  Chengyang Li,et al.  Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation , 2018, BMVC.

[24]  Hang Zhang,et al.  Multi-style Generative Network for Real-time Transfer , 2017, ECCV Workshops.

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[27]  Heiko Neumann,et al.  Fully Convolutional Region Proposal Networks for Multispectral Person Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[29]  Bernt Schiele,et al.  Learning Non-maximum Suppression , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Larry S. Davis,et al.  Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Nicu Sebe,et al.  Learning Cross-Modal Deep Representations for Robust Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Shu Wang,et al.  Multispectral Deep Neural Networks for Pedestrian Detection , 2016, BMVC.

[34]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[36]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Sven Behnke,et al.  Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks , 2016, ESANN.

[38]  Kihong Park,et al.  Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[39]  Namil Kim,et al.  Multispectral pedestrian detection: Benchmark dataset and baseline , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[42]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[45]  Thierry Denoeux,et al.  Evidential combination of pedestrian detectors , 2014, BMVC.

[46]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[49]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[50]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[53]  B. Schiele,et al.  Pedestrian detection: A benchmark , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[55]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[56]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[57]  Josef Kittler,et al.  Combining classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[58]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[59]  A. Weigend,et al.  Estimating the mean and variance of the target probability distribution , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[60]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[61]  A. Dawid Conditional Independence in Statistical Theory , 1979 .