Multi-Modal Pedestrian Detection with Large Misalignment Based on Modal-Wise Regression and Multi-Modal IoU

The combined use of multiple modalities enables accurate pedestrian detection under poor lighting conditions by using the high visibility areas from these modalities together. The vital assumption for the combination use is that there is no or only a weak misalignment between the two modalities. In general, however, this assumption often breaks in actual situations. Due to this assumption's breakdown, the position of the bounding boxes does not match between the two modalities, resulting in a significant decrease in detection accuracy, especially in regions where the amount of misalignment is large. In this paper, we propose a multimodal Faster-RCNN that is robust against large misalignment. The keys are 1) modal-wise regression and 2) multi-modal IoU for mini-batch sampling. To deal with large misalignment, we perform bounding box regression for both the RPN and detection-head with both modalities. We also propose a new sampling strategy called “multi-modal mini-batch sampling” that integrates the IoU for both modalities. We demonstrate that the proposed method's performance is much better than that of the state-of-the-art methods for data with large misalignment through actual image experiments.

[1]  Zhen Lei,et al.  Weakly Aligned Feature Fusion for Multimodal Object Detection , 2021, IEEE transactions on neural networks and learning systems.

[2]  Heng ZHANG,et al.  Guided Attentive Feature Fusion for Multispectral Pedestrian Detection , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[3]  Xun Cao,et al.  Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems , 2020, ECCV.

[4]  Hong Qiao,et al.  Cross-modality interactive attention network for multispectral pedestrian detection , 2019, Inf. Fusion.

[5]  Alexander Carballo,et al.  A Survey of Autonomous Driving: Common Practices and Emerging Technologies , 2019, IEEE Access.

[6]  Masatoshi Okutomi,et al.  Full thermal panorama from a long wavelength infrared and visible camera system , 2019, J. Electronic Imaging.

[7]  Xiangyu Zhu,et al.  Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Masatoshi Okutomi,et al.  DISPARITY MAP ESTIMATION FROM CROSS-MODAL STEREO , 2018, 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[9]  Chengyang Li,et al.  Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation , 2018, BMVC.

[10]  Byron Boots,et al.  Learning to Align Images Using Weak Geometric Supervision , 2018, 2018 International Conference on 3D Vision (3DV).

[11]  Kihong Park,et al.  Unified multi-spectral pedestrian detection based on probabilistic fusion networks , 2018, Pattern Recognit..

[12]  Shifeng Zhang,et al.  Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd , 2018, ECCV.

[13]  Chengyang Li,et al.  Illumination-aware Faster R-CNN for Robust Multispectral Pedestrian Detection , 2018, Pattern Recognit..

[14]  Michael Ying Yang,et al.  Fusion of Multispectral Data Through Illumination-aware Deep Neural Networks for Pedestrian Detection , 2018, Inf. Fusion.

[15]  In So Kweon,et al.  KAIST Multi-Spectral Day/Night Data Set for Autonomous and Assisted Driving , 2018, IEEE Transactions on Intelligent Transportation Systems.

[16]  Masatoshi Okutomi,et al.  Misalignment-Robust Joint Filter for Cross-Modal Image Pairs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Heiko Neumann,et al.  Fully Convolutional Region Proposal Networks for Multispectral Person Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Scott Sorensen,et al.  CATS: A Color and Thermal Stereo Benchmark , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yuning Jiang,et al.  What Can Help Pedestrian Detection? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Masatoshi Okutomi,et al.  Coaxial visible and FIR camera system with accurate geometric calibration , 2017, Commercial + Scientific Sensing and Imaging.

[21]  Nicu Sebe,et al.  Learning Cross-Modal Deep Representations for Robust Pedestrian Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[23]  Masatoshi Okutomi,et al.  Accurate Joint Geometric Camera Calibration of Visible and Far-Infrared Cameras , 2017, IMSE.

[24]  Shu Wang,et al.  Multispectral Deep Neural Networks for Pedestrian Detection , 2016, BMVC.

[25]  Liang Lin,et al.  Is Faster R-CNN Doing Well for Pedestrian Detection? , 2016, ECCV.

[26]  Jiaolong Xu,et al.  Pedestrian Detection at Day/Night Time with Visible and FIR Cameras: A Comparison , 2016, Sensors.

[27]  Shuicheng Yan,et al.  Scale-Aware Fast R-CNN for Pedestrian Detection , 2015, IEEE Transactions on Multimedia.

[28]  Namil Kim,et al.  Multispectral pedestrian detection: Benchmark dataset and baseline , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Minh N. Do,et al.  DASC: Dense adaptive self-correlation descriptor for multi-modal and multi-spectral correspondence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[32]  Bin Yang,et al.  Convolutional Channel Features , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[36]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Shutao Li,et al.  Image Fusion With Guided Filtering , 2013, IEEE Transactions on Image Processing.

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Christopher K. I. Williams,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge , 2022 .

[41]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  B. Schiele,et al.  Pedestrian detection: A benchmark , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  N. Dalal,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Kihong Park,et al.  Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[45]  Sven Behnke,et al.  Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks , 2016, ESANN.

[46]  M. Savitha,et al.  Image Fusion with Guided Filtering , 2014 .

[47]  Pietro Perona,et al.  Integral Channel Features , 2009, BMVC.

[48]  A. Krizhevsky ImageNet Classification with Deep Convolutional Neural Networks , 2022 .