Weakly Supervised Action Detection

The detection of human action in videos of busy natural scenes with dynamic background is of interest for applications such as video surveillance. Taking a conventional fully supervised approach, the spatio-temporal locations of the action of interest have to be manually annotated frame by frame in the training videos, which is tedious and unreliable. In this paper, for the first time, a weakly supervised action detection method is proposed which only requires binary labels of the videos indicating the presence of the action of interest. Given a training set of binary labelled videos, the weakly supervised learning (WSL) problem is recast as a multiple instance learning (MIL) problem. A novel MIL algorithm is developed which differs from the existing MIL algorithms in that it locates the action of interest spatially and temporally by globally optimising both interand intra-class distance. We demonstrate through experiments that our WSL approach can achieve comparable detection performance to a fully supervised learning approach, and that the proposed MIL algorithm significantly outperforms the existing ones.

[1]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[2]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[3]  P. Siva,et al.  Action Detection in Crowd , 2010, BMVC.

[4]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[8]  Christian Wolf,et al.  Action recognition in videos , 2012, 2012 3rd International Conference on Image Processing Theory, Tools and Applications (IPTA).

[9]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[10]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[11]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yang Wang,et al.  Efficient Human Action Detection Using a Transferable Distance Function , 2009, ACCV.

[13]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Arnaldo de Albuquerque Araújo,et al.  Action Recognition in Videos: from Motion Capture Labs to the Web , 2010, ArXiv.

[15]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[16]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[17]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[19]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[20]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.