Learning to Filter Object Detections

Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.

[1]  Pietro Perona,et al.  Pedestrian Detection: An Evaluation of the State of the Art , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Peter Kontschieder,et al.  Evolutionary Hough Games for coherent object detection , 2012, Comput. Vis. Image Underst..

[4]  Pushmeet Kohli,et al.  On Detection of Multiple Object Instances Using Hough Transforms , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[6]  Bernt Schiele,et al.  A Convnet for Non-maximum Suppression , 2015, GCPR.

[7]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Rogério Schmidt Feris,et al.  A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection , 2016, ECCV.

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  Li Wan,et al.  End-to-end integration of a Convolutional Network, Deformable Parts Model and non-maximum suppression , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Vittorio Ferrari,et al.  End-to-End Training of Object Class Detectors for Mean Average Precision , 2016, ACCV.

[15]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Matthieu Guillaumin,et al.  Non-maximum Suppression for Object Detection by Passing Messages Between Windows , 2014, ACCV.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.