Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences

Complex events consist of various human interactions with different objects in diverse environments. The evidences needed to recognize events may occur in short time periods with variable lengths and can happen anywhere in a video. This fact prevents conventional machine learning algorithms from effectively recognizing the events. In this paper, we propose a novel method that can automatically identify the key evidences in videos for detecting complex events. Both static instances (objects) and dynamic instances (actions) are considered by sampling frames and temporal segments respectively. To compare the characteristic power of heterogeneous instances, we embed static and dynamic instances into a multiple instance learning framework via instance similarity measures, and cast the problem as an Evidence Selective Ranking (ESR) process. We impose l1 norm to select key evidences while using the Infinite Push Loss Function to enforce positive videos to have higher detection scores than negative videos. The Alternating Direction Method of Multipliers (ADMM) algorithm is used to solve the optimization problem. Experiments on large-scale video datasets show that our method can improve the detection accuracy while providing the unique capability in discovering key evidences of each complex event.

[1]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[2]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[3]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[4]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.

[5]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[7]  Shih-Fu Chang,et al.  Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos , 2014, ICMR.

[8]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Cynthia Rudin,et al.  The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List , 2009, J. Mach. Learn. Res..

[10]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[11]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[12]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[13]  Gang Hua,et al.  Scene Aligned Pooling for Complex Video Recognition , 2012, ECCV.

[14]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[15]  Matthieu Guillaumin,et al.  Segmentation Propagation in ImageNet , 2012, ECCV.

[16]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[17]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[18]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[19]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[21]  Shivani Agarwal,et al.  The Infinite Push: A New Support Vector Ranking Algorithm that Directly Optimizes Accuracy at the Absolute Top of the List , 2011, SDM.

[22]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[24]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Alain Rakotomamonjy,et al.  Sparse Support Vector Infinite Push , 2012, ICML.