Active Learning of an Action Detector from Untrimmed Videos

Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the "untrimmed" nature of real video data.

[1]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[2]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[4]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[5]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[6]  Adriana Kovashka,et al.  Actively selecting annotations among objects and attributes , 2011, 2011 International Conference on Computer Vision.

[7]  Rong Yan,et al.  Automatically labeling video data using multi-class active learning , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[10]  Deva Ramanan,et al.  Video Annotation and Tracking with Active Learning , 2011, NIPS.

[11]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Abhinav Gupta,et al.  Beyond active noun tagging: Modeling contextual interactions for multi-class active learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Maja Pantic,et al.  An implicit spatiotemporal shape model for human activity localization and recognition , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Kristen Grauman,et al.  Multi-Level Active Prediction of Useful Image Annotations for Recognition , 2008, NIPS.

[19]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[20]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[21]  Fei-Fei Li,et al.  Towards Scalable Dataset Construction: An Active Learning Approach , 2008, ECCV.

[22]  Russell Greiner,et al.  Optimistic Active-Learning Using Mutual Information , 2007, IJCAI.

[23]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  James M. Rehg,et al.  Combining Self Training and Active Learning for Video Segmentation , 2011, BMVC.

[26]  Kristen Grauman,et al.  Active Frame Selection for Label Propagation in Videos , 2012, ECCV.

[27]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[28]  Scott Cohen,et al.  LIVEcut: Learning-based interactive video segmentation by evaluation of multiple propagated cues , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[29]  Xian-Sheng Hua,et al.  Two-Dimensional Active Learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[31]  Maneesh Agrawala,et al.  Interactive video cutout , 2005, SIGGRAPH 2005.

[32]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[33]  Roberto Cipolla,et al.  Extracting Spatiotemporal Interest Points using Global Information , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[34]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[35]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[36]  Luc Van Gool,et al.  Exemplar-based Action Recognition in Video , 2009, BMVC.

[37]  Prateek Jain,et al.  Far-sighted active learning on a budget for image and video recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[39]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[40]  Antonio Torralba,et al.  LabelMe video: Building a video database with human annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[41]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[42]  Kristen Grauman,et al.  Efficient activity detection with max-subgraph search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.