A Novel Dictionary Learning based Multiple Instance Learning Approach to Action Recognition from Videos

In this paper we deal with the problem of action recognition from unconstrained videos under the notion of multiple instance learning (MIL). The traditional MIL paradigm considers the data items as bags of instances with the constraint that the positive bags contain some class-specific instances whereas the negative bags consist of instances only from negative classes. A classifier is then further constructed using the bag level annotations and a distance metric between the bags. However, such an approach is not robust to outliers and is time consuming for a moderately large dataset. In contrast, we propose a dictionary learning based strategy to MIL which first identifies class-specific discriminative codewords, and then projects the bag-level instances into a probabilistic embedding space with respect to the selected codewords. This essentially generates a fixedlength vector representation of the bags which is specifically dominated by the properties of the class-specific instances. We introduce a novel exhaustive search strategy using a support vector machine classifier in order to highlight the class-specific codewords. The standard multiclass classification pipeline is followed henceforth in the new embedded feature space for the sake of action recognition. We validate the proposed framework on the challenging KTH and Weizmann datasets, and the results obtained are promising and comparable to representative techniques from the literature.

[1]  Joseph F. Murray,et al.  Dictionary Learning Algorithms for Sparse Representation , 2003, Neural Computation.

[2]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[3]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[4]  Luo Si,et al.  M3IC: Maximum Margin Multiple Instance Clustering , 2009, IJCAI.

[5]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[6]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[8]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[10]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[11]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[13]  Chih-Fong Tsai,et al.  Keypoint selection for efficient bag-of-words feature generation and effective image classification , 2016, Inf. Sci..

[14]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[15]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[16]  Philip H. S. Torr,et al.  Learning Discriminative Space–Time Action Parts from Weakly Labelled Videos , 2013, International Journal of Computer Vision.

[17]  F. Bremond,et al.  HUMAN ACTION RECOGNITION IN VIDEOS : A SURVEY , 2016 .

[18]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Zhi-Hua Zhou Multi-Instance Learning : A Survey , 2004 .

[21]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[23]  Limin Wang,et al.  A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition , 2012, ACCV.

[24]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[25]  Huafeng Chen,et al.  Multiple instance discriminative dictionary learning for action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[27]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[29]  Zhuowen Tu,et al.  Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[30]  Bingbing Ni,et al.  Interaction part mining: A mid-level approach for fine-grained action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  C. V. Jawahar,et al.  A Robust Distance with Correlated Metric Learning for Multi-Instance Multi-Label Data , 2016, ACM Multimedia.

[32]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[36]  BoyerEdmond,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011 .