From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification

Automatically classifying large scale of video data is an urgent yet challenging task. To bridge the semantic gap between low-level features and high-level video semantics, we propose a method to represent videos with their mid-level semantics. Inspired by the problem of text classification, we regard the visual objects in videos as the words in documents, and adapt the TF-IDF word weighting method to encode videos by visual objects. Some extensions upon the proposed method are also made according to the characteristics of videos. We integrate the proposed semantic encoding method with the popular two-stream CNN model for video classification. Experiments are conducted on two large-scale video datasets, CCV and ActivityNet. The experimanetal results validates the effectiveness of our method.

[1]  Edward A. Fox,et al.  Research Contributions , 2014 .

[2]  Yi Zhu,et al.  UC Merced Submission to the ActivityNet Challenge 2016 , 2017, ArXiv.

[3]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[4]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[6]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  G. K. Tam,et al.  Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams , 2015 .

[8]  Chengqi Zhang,et al.  Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[9]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[10]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[11]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[15]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[16]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ramakant Nevatia,et al.  DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[19]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Xi Wang,et al.  Evaluating Two-Stream CNN for Video Classification , 2015, ICMR.

[24]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[25]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[26]  Hui Cheng,et al.  Video event recognition using concept attributes , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[27]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Wei Li,et al.  CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016 , 2016, ArXiv.

[29]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[32]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Fabio Cuzzolin,et al.  Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge , 2016, ArXiv.