Weakly supervised discriminative localization and classification: a joint learning process

Visual categorization problems, such as object classification or action recognition, are increasingly often approached using a detection strategy: a classifier function is first applied to candidate subwindows of the image or the video, and then the maximum classifier score is used for class decision. Traditionally, the subwindow classifiers are trained on a large collection of examples manually annotated with masks or bounding boxes. The reliance on time-consuming human labeling effectively limits the application of these methods to problems involving very few categories. Furthermore, the human selection of the masks introduces arbitrary biases (e.g. in terms of window size and location) which may be suboptimal for classification. In this paper we propose a novel method for learning a discriminative subwindow classifier from examples annotated with binary labels indicating the presence of an object or action of interest, but not its location. During training, our approach simultaneously localizes the instances of the positive class and learns a subwindow SVM to recognize them. We extend our method to classification of time series by presenting an algorithm that localizes the most discriminative set of temporal segments in the signal. We evaluate our approach on several datasets for object and action recognition and show that it achieves results similar and in many cases superior to those obtained with full supervision.

[1]  Oded Maron,et al.  Multiple-Instance Learning for Natural Scene Classification , 1998, ICML.

[2]  Haim Schweitzer,et al.  Utilizing scatter for pixel subspace selection , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  Tomás Lozano-Pérez,et al.  Image database retrieval with multiple-instance learning techniques , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[5]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[6]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[9]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  Jianbo Shi,et al.  Object-Specific Figure-Ground Segregation , 2003, CVPR.

[11]  Shimon Ullman,et al.  Combining Top-Down and Bottom-Up Segmentation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[12]  N. Goodwin,et al.  Learning to Detect Objects in Images via a Sparse, Part-Based Representation , 2004 .

[13]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[14]  Bastian Leibe,et al.  Interleaved Object Categorization and Segmentation , 2003, BMVC.

[15]  Dan Roth,et al.  Learning to detect objects in images via a sparse, part-based representation , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, CVPR 2004.

[17]  Pietro Perona,et al.  Hybrid models for human motion recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[21]  Alexei A. Efros,et al.  Using Multiple Segmentations to Discover Objects and their Extent in Image Collections , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22]  Narendra Ahuja,et al.  Extracting Subimages of an Unknown Category from a Set of Images , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24]  Andrew Zisserman,et al.  An Exemplar Model for Learning Object Classes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Li Fei-Fei,et al.  Spatially coherent latent topic model for concurrent object segmentation and classification , 2007 .

[26]  Fernando De la Torre,et al.  Facial feature detection with optimal pixel reduction SVM , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[27]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[29]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Boris Babenko,et al.  Weakly Supervised Object Localization with Stable Segmentations , 2008, ECCV.

[31]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[32]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .