Improving Human Action Recognition Using Score Distribution and Ranking

We propose two complementary techniques to improve the performance of action recognition systems. The first technique addresses the temporal interval ambiguity of actions by learning a classifier score distribution over video subsequences. A classifier based on this score distribution is shown to be more effective than using the maximum or average scores. The second technique learns a classifier for the relative values of action scores, capturing the correlation and exclusion between action classes. Both techniques are simple and have efficient implementations using a Least-Squares SVM. We demonstrate that taken together the techniques exceed the state-of-the-art performance by a wide margin on challenging benchmarks for human actions.

[1]  C. Schmid,et al.  Recognizing activities with cluster-trees of tracklets , 2012, BMVC.

[2]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[3]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[4]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[5]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[6]  James T. Kwok,et al.  Marginalized Multi-Instance Kernels , 2007, IJCAI.

[7]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[8]  Greg Mori,et al.  Similarity Constrained Latent Support Vector Machine: An Application to Weakly Supervised Action Classification , 2012, ECCV.

[9]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[10]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Ye Xu,et al.  FAMER: Making Multi-Instance Learning Better and Faster , 2011, SDM.

[12]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[15]  Carsten Rother,et al.  Weakly supervised discriminative localization and classification: a joint learning process , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[18]  James T. Kwok,et al.  A regularization framework for multiple-instance learning , 2006, ICML.

[19]  Horst Bischof,et al.  Multiple Instance Boosting for Face Recognition in Videos , 2011, DAGM-Symposium.

[20]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[21]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[22]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[23]  Ivor W. Tsang,et al.  Text-based image retrieval using progressive multi-instance learning , 2011, 2011 International Conference on Computer Vision.

[24]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[25]  T. Poggio,et al.  Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries , 1992 .

[26]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Tarek Sayed,et al.  Multiple Instance Learning by Discriminative Training of Markov Networks , 2013, UAI.

[28]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[30]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Recognition , 2012, ECCV.

[31]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[32]  Geoffrey E. Hinton,et al.  Learning Generative Texture Models with extended Fields-of-Experts , 2009, BMVC.

[33]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decisionmaking , 1988, IEEE Trans. Syst. Man Cybern..

[34]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[35]  Manuel J. Marín-Jiménez,et al.  Exploring STIP-based models for recognizing human interactions in TV videos , 2013, Pattern Recognit. Lett..

[36]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Matthew B. Blaschko,et al.  Learning equivariant structured output SVM regressors , 2011, 2011 International Conference on Computer Vision.

[38]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[39]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Greg Mori,et al.  Multiple instance real boosting with aggregation functions , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[41]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[42]  Minh Hoai,et al.  Regularized Max Pooling for Image Categorization , 2014, BMVC.

[43]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[44]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[45]  Jian Dong,et al.  Contextualizing Object Detection and Classification , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Dimitar Filev,et al.  Induced ordered weighted averaging operators , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[47]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Mubarak Shah,et al.  Improving Semantic Concept Detection and Retrieval using Contextual Estimates , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[49]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[50]  Barbara Caputo,et al.  The More You Know, the Less You Learn: From Knowledge Transfer to One-shot Learning of Object Categories , 2009, BMVC.

[51]  Cristian Sminchisescu,et al.  Convex Multiple-Instance Learning by Estimating Likelihood Ratio , 2010, NIPS.

[52]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[53]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[55]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[56]  Nenghai Yu,et al.  Multiple-instance ranking: Learning to rank images for image retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[58]  Mark Craven,et al.  Supervised versus multiple instance learning: an empirical comparison , 2005, ICML.

[59]  Gavin C. Cawley,et al.  Fast exact leave-one-out cross-validation of sparse least-squares support vector machines , 2004, Neural Networks.

[60]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[61]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[62]  Ye Xu,et al.  Non-I.I.D. Multi-Instance Dimensionality Reduction by Learning a Maximum Bag Margin Subspace , 2010, AAAI.

[63]  Andrew Zisserman,et al.  Talking Heads: Detecting Humans and Recognizing Their Interactions , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Zhi-Hua Zhou,et al.  Multi-instance learning by treating instances as non-I.I.D. samples , 2008, ICML '09.