Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images

State-of-the-art visual recognition and detection systems increasingly rely on large amounts of training data and complex classifiers. Therefore it becomes increasingly expensive both to manually annotate datasets and to keep running times at levels acceptable for practical applications. In this paper, we propose two solutions to address these issues. First, we introduce a weakly supervised, segmentation-based approach to learn accurate detectors and image classifiers from weak supervisory signals that provide only approximate constraints on target localization. We illustrate our system on the problem of action detection in static images (Pascal VOC Actions 2012), using human visual search patterns as our training signal. Second, inspired from the saccade-and-fixate operating principle of the human visual system, we use reinforcement learning techniques to train efficient search models for detection. Our sequential method is weakly supervised and general (it does not require eye movements), finds optimal search strategies for any given detection confidence function and achieves performance similar to exhaustive sliding window search at a fraction of its computational cost.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Yichen Wei,et al.  Efficient histogram-based sliding window , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Joachim M. Buhmann,et al.  Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[8]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.

[10]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[12]  Iasonas Kokkinos,et al.  Shufflets: Shared Mid-level Parts for Fast Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[14]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[15]  Kate Saenko,et al.  Confidence-Rated Multiple Instance Boosting for Object Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Razvan C. Bunescu,et al.  Multiple instance learning for sparse positive bags , 2007, ICML '07.

[17]  Krista A. Ehinger,et al.  Modelling search for people in 900 scenes: A combined source model of eye guidance , 2009 .

[18]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[19]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[20]  Jaume Amores,et al.  Multiple instance classification: Review, taxonomy and comparative study , 2013, Artif. Intell..

[21]  Greg Mori,et al.  Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization , 2013, NIPS.

[22]  Harish Katti,et al.  An Eye Fixation Database for Saliency Detection in Images , 2010, ECCV.

[23]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[24]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[25]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[26]  Javier R. Movellan,et al.  Infomax Control of Eye Movements , 2010, IEEE Transactions on Autonomous Mental Development.

[27]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[28]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[30]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.