Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks

In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by imposing regularizations in the learning process of a deep neural network (DNN). Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics. We show that rDNN produces better performance over several state-of-the-art approaches. Competitive results are reported on the well-known Hollywood2 and Columbia Consumer Video benchmarks. In addition, to stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, which contains 91,223 Internet videos and 239 manually annotated categories.

[1]  Yoshua Bengio,et al.  Multi-Task Learning for Stock Selection , 1996, NIPS.

[2]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[4]  John R. Smith,et al.  Multimedia semantic indexing using model vectors , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[5]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[6]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[7]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[8]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[9]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[12]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[13]  Tao Mei,et al.  Correlative multi-label video annotation , 2007, ACM Multimedia.

[14]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  T. Stanford,et al.  Multisensory integration: current issues from the perspective of the single neuron , 2008, Nature Reviews Neuroscience.

[16]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[18]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  T. Stanford,et al.  Multisensory integration: current issues from the perspective of the single neuron , 2008, Nature Reviews Neuroscience.

[20]  Jiebo Luo,et al.  Heterogeneous feature machines for visual recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  C. Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[24]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[25]  Xiangyang Xue,et al.  A novel audio fingerprinting method robust to time scale modification and pitch shifting , 2010, ACM Multimedia.

[26]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[27]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[28]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[29]  Alexander C. Loui,et al.  Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[30]  Jiayu Zhou,et al.  Integrating low-rank and group-sparse structures for robust multi-task learning , 2011, KDD.

[31]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[32]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[33]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[34]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[35]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[36]  Jiayu Zhou,et al.  A multi-task learning formulation for predicting disease progression , 2011, KDD.

[37]  Kristen Grauman,et al.  Learning with Whom to Share in Multi-task Feature Learning , 2011, ICML.

[38]  G. DeAngelis,et al.  A Normalization Model of Multisensory Integration , 2011, Nature Neuroscience.

[39]  Hongliang Fei,et al.  Structured Feature Selection and Task Relationship Inference for Multi-task Learning , 2011, ICDM.

[40]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[41]  Chong-Wah Ngo,et al.  Fast Semantic Diffusion for Large-Scale Context-Based Image and Video Annotation , 2012, IEEE Transactions on Image Processing.

[42]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[43]  Daoqiang Zhang,et al.  Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer's disease , 2012, NeuroImage.

[44]  Chin-Hui Lee,et al.  Explicit Performance Metric Optimization for Fusion-Based Video Retrieval , 2012, ECCV Workshops.

[45]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[48]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[49]  Yung-Yu Chuang,et al.  Cross-Domain Multicue Fusion for Concept-Based Video Indexing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Xiangyang Xue,et al.  Multiple Task Learning Using Iteratively Reweighted Least Square , 2013, IJCAI.

[53]  Dong Liu,et al.  Large-Scale Video Hashing via Structure Learning , 2013, 2013 IEEE International Conference on Computer Vision.

[54]  Samy Bengio,et al.  Using Web Co-occurrence Statistics for Improving Image Categorization , 2013, ArXiv.

[55]  Dong Liu,et al.  Discovering joint audio–visual codewords for video event detection , 2013, Machine Vision and Applications.

[56]  Florian Metze,et al.  CMU-Informedia @ TRECVID 2013 Multimedia Event Detection , 2013 .

[57]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Daniel P. W. Ellis,et al.  Subband autocorrelation features for video soundtrack classification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[59]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Cordelia Schmid,et al.  The AXES submissions at TRECVID 2013 , 2013, TRECVID.

[62]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[63]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[64]  Nicu Sebe,et al.  Feature Weighting via Optimal Thresholding for Video Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[65]  Lynne E. Parker,et al.  Simplex-Based 3D Spatio-temporal Feature Description for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Pong C. Yuen,et al.  Reduced Analytic Dependency Modeling: Robust Fusion for Visual Recognition , 2014, International Journal of Computer Vision.

[68]  Marc'Aurelio Ranzato,et al.  Multi-GPU Training of ConvNets , 2013, ICLR.

[69]  Andrew Zisserman,et al.  Improving Human Action Recognition Using Score Distribution and Ranking , 2014, ACCV.

[70]  Yi Li,et al.  Mariana: Tencent Deep Learning Platform and its Applications , 2014, Proc. VLDB Endow..

[71]  Mubarak Shah,et al.  Video Classification Using Semantic Concept Co-occurrences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[73]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[74]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Bingbing Ni,et al.  Beta Process Multiple Kernel Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[77]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.

[80]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[83]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[86]  Chong-Wah Ngo,et al.  Multimedia Event Detection , 2015 .

[87]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[88]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).