Localized Multiple Kernel Learning for Realistic Human Action Recognition in Videos

Realistic human action recognition in videos has been a useful yet challenging task. Video shots of same actions may present huge intra-class variations in terms of visual appearance, kinetic patterns, video shooting, and editing styles. Heterogeneous feature representations of videos pose another challenge on how to effectively handle the redundancy, complementariness and disagreement in these features. This paper proposes a localized multiple kernel learning (L-MKL) algorithm to tackle the issues above. L-MKL integrates the localized classifier ensemble learning and multiple kernel learning in a unified framework to leverage the strengths of both. The basis of L-MKL is to build multiple kernel classifiers on diverse features at subspace localities of heterogeneous representations. L-MKL integrates the discriminability of complementary features locally and enables localized MKL classifiers to deliver better performance in its own region of expertise. Specifically, L-MKL develops a locality gating model to partition the input space of heterogeneous representations to a set of localities of simpler data structure. Each locality then learns its localized optimal combination of Mercer kernels of heterogeneous features. Finally, the gating model coordinates the localized multiple kernel classifiers globally to perform action recognition. Experiments on two datasets show that the proposed approach delivers promising performance.

[1]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[2]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[3]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[4]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[5]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[6]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[9]  Peyman Milanfar,et al.  Action Recognition from One Example , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[11]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[12]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Yihong Gong,et al.  Action detection in complex scenes with spatial and temporal ambiguities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[17]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[18]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[20]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[21]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[24]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Motoaki Kawanabe,et al.  Multiple Kernel Learning for Object Classification , 2009 .

[28]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[29]  Ethem Alpaydin,et al.  Localized multiple kernel learning , 2008, ICML '08.

[30]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[31]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[32]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[33]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[34]  Larry S. Davis,et al.  Incremental Multiple Kernel Learning for object recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Wen Gao,et al.  Group-sensitive multiple kernel learning for object categorization , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38]  ChellappaRama,et al.  Matching Shape Sequences in Video with Applications in Human Movement Analysis , 2005 .

[39]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[40]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[42]  Yongdong Zhang,et al.  Automatic Detection and Analysis of Player Action in Moving Background Sports Video Sequences , 2010, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[44]  David W. Opitz,et al.  Actively Searching for an E(cid:11)ective Neural-Network Ensemble , 1996 .

[45]  David W. Opitz,et al.  Generating Accurate and Diverse Members of a Neural-Network Ensemble , 1995, NIPS.

[46]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.