Deep Fusion of Multiple Semantic Cues for Complex Event Recognition

We present a deep learning strategy to fuse multiple semantic cues for complex event recognition. In particular, we tackle the recognition task by answering how to jointly analyze human actions (who is doing what), objects (what), and scenes (where). First, each type of semantic features (e.g., human action trajectories) is fed into a corresponding multi-layer feature abstraction pathway, followed by a fusion layer connecting all the different pathways. Second, the correlations of how the semantic cues interacting with each other are learned in an unsupervised cross-modality autoencoder fashion. Finally, by fine-tuning a large-margin objective deployed on this deep architecture, we are able to answer the question on how the semantic cues of who, what, and where compose a complex event. As compared with the traditional feature fusion methods (e.g., various early or late strategies), our method jointly learns the essential higher level features that are most effective for fusion and recognition. We perform extensive experiments on two real-world complex event video benchmarks, MED'11 and CCV, and demonstrate that our method outperforms the best published results by 21% and 11%, respectively, on an event recognition task.

[1]  Ramakant Nevatia,et al.  ISOMER: Informative Segment Observations for Multimedia Event Recounting , 2014, ICMR.

[2]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[4]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[5]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Mubarak Shah,et al.  Complex Events Detection Using Data-Driven Concepts , 2012, ECCV.

[8]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[10]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[11]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[12]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[13]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[14]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.

[15]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[17]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[18]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[21]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Yichuan Tang,et al.  Deep Learning using Support Vector Machines , 2013, ArXiv.

[24]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[25]  Nicu Sebe,et al.  Feature Weighting via Optimal Thresholding for Video Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[30]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[31]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ramakant Nevatia,et al.  Evaluating multimedia features and fusion for example-based event detection , 2013, Machine Vision and Applications.

[33]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[34]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[35]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[36]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[37]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[38]  Ramakant Nevatia,et al.  ACTIVE: Activity Concept Transitions in Video Event Classification , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Dong Liu,et al.  Discovering joint audio–visual codewords for video event detection , 2013, Machine Vision and Applications.

[40]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[41]  Mubarak Shah,et al.  Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[42]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[43]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[44]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[45]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[46]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..