Weakly supervised easy-to-hard learning for object detection in image sequences

Abstract Object detection is an important research problem in computer vision. Convolutional Neural Networks (CNN) based deep learning models could be used for this problem, but it would require a large number of manual annotated objects for training or fine-tuning. Unfortunately, fine-grained manually annotated objects are not available in many cases. Usually, it is possible to obtain imperfect initialized detections by some weak object detectors using some weak supervisions like the prior knowledge of shape, size or motion. In some real-world applications, objects have little inter-occlusions and split/merge difficulties, so the spatio-temporal consistency in object tracking are well preserved in the image sequences/videos. Starting from the imperfect initialization, this paper proposes a new easy-to-hard learning method to incrementally improve the object detection in image sequences/videos by an unsupervised spatio-temporal analysis which involves more complex examples that are hard for object detection for next-iteration training. The proposed method does not require manual annotations, but uses weak supervisions and spatio-temporal consistency in tracking to simulate the supervisions in the CNN training. Experimental results on three different tasks show significant improvements over the initialized detections by the weak object detectors.

[1]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Youjie Zhou,et al.  Large-Scale Fiber Tracking Through Sparsely Sampled Image Sequences of Composite Materials , 2016, IEEE Transactions on Image Processing.

[4]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[5]  Peiyun Hu,et al.  Finding Tiny Faces , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ruigang Yang,et al.  Inferring Salient Objects from Human Fixations , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Song Wang,et al.  An easy-to-hard learning strategy for within-image co-saliency detection , 2019, Neurocomputing.

[8]  Zhenyu He,et al.  Multiple pedestrian tracking by combining particle filter and network flow model , 2019, Neurocomputing.

[9]  Song Wang,et al.  Improved Deep Hashing With Soft Pairwise Similarity for Multi-Label Image Retrieval , 2018, IEEE Transactions on Multimedia.

[10]  Konrad Schindler,et al.  Continuous Energy Minimization for Multitarget Tracking , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Qiang Ji,et al.  A new efficient ellipse detection method , 2002, Object recognition supported by user interaction for service robots.

[12]  Qian Wang,et al.  DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection , 2019, IEEE Transactions on Image Processing.

[13]  Hao Guo,et al.  Multiple human tracking in wearable camera videos with informationless intervals , 2018, Pattern Recognit. Lett..

[14]  Qingquan Li,et al.  Robust Gait Recognition by Integrating Inertial and RGBD Sensors , 2016, IEEE Transactions on Cybernetics.

[15]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Yunchao Wei,et al.  STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Wenjun Zeng,et al.  Object Detection in Videos by High Quality Object Linking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ming-Hsuan Yang,et al.  Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Bingbing Ni,et al.  Deep Regression Tracking with Shrinkage Loss , 2018, ECCV.

[22]  Youjie Zhou,et al.  Co-Interest Person Detection from Multiple Wearable Camera Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Marius Leordeanu,et al.  Unsupervised Learning from Video to Detect Foreground Objects in Single Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Ruigang Yang,et al.  Semi-Supervised Video Object Segmentation with Super-Trajectories , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[27]  Yu Zhang,et al.  Supervision by Fusion: Towards Unsupervised Learning of Deep Salient Object Detector , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[29]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[30]  Gang Wang,et al.  Object Instance Search in Videos via Spatio-Temporal Trajectory Discovery , 2016, IEEE Transactions on Multimedia.

[31]  Li Shuguang,et al.  Video-based traffic data collection system for multiple vehicle types , 2014 .

[32]  Shuicheng Yan,et al.  Scale-Aware Fast R-CNN for Pedestrian Detection , 2015, IEEE Transactions on Multimedia.

[33]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[35]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[36]  Stefan Roth,et al.  MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking , 2015, ArXiv.

[37]  Ruigang Yang,et al.  Saliency-Aware Video Object Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Pierre Gurdjos,et al.  A Parameterless Line Segment and Elliptical Arc Detector with Enhanced Ellipse Fitting , 2012, ECCV.

[39]  James M. Rehg,et al.  Unsupervised Learning of Edges , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Shuguang Li,et al.  Video-Based Traffic Data Collection System for Multiple Vehicle Types , 2012 .

[41]  Ming-Hsuan Yang,et al.  Online Multi-object Tracking via Structural Constraint Event Aggregation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Wenyu Liu,et al.  Deep patch learning for weakly supervised object classification and discovery , 2017, Pattern Recognit..

[43]  Mei Tian,et al.  Robust Nighttime Vehicle Detection by Tracking and Grouping Headlights , 2015, IEEE Transactions on Intelligent Transportation Systems.

[44]  Wenyu Liu,et al.  PCL: Proposal Cluster Learning for Weakly Supervised Object Detection , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Youjie Zhou,et al.  Groupwise Tracking of Crowded Similar-Appearance Targets from Low-Continuity Image Sequences , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yunchao Wei,et al.  Learning to Segment Human by Watching YouTube , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[48]  Xiaolin Hu,et al.  Joint Training of Cascaded CNN for Face Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Edward J. Delp,et al.  The EM/MPM algorithm for segmentation of textured images: analysis and further experimental results , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[50]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[52]  Hao Guo,et al.  Visual Attention Consistency Under Image Transforms for Multi-Label Image Classification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Louis-Philippe Morency,et al.  Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks , 2016, ArXiv.