A Deep Learned Method for Video Indexing and Retrieval

In this paper, we proposed a deep neural network based method for content based video retrieval. Our approach leveraged the deep neural network to generate the semantic information and introduced the graph-based storage structure to establish the video indices. We devised the Inception-Single Shot Multibox Detector (ISSD) and RI3D model to extract spatial semantic information (objects) and extract temporal semantic information (actions). Our ISSD model achieved a mAP of 26.7% on MS COCO dataset, increasing 3.2% over the original SSD model, while the RI3D model achieved a top-1 accuracy of 97.7% on dataset UCF-101. And we also introduced the graph structure to build the video index with the temporal and spatial semantic information. Our experiment results showed that the deep learned semantic information is highly effective for video indexing and retrieval.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Chong-Wah Ngo,et al.  Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts , 2016, ICMR.

[4]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[6]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[7]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[10]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[12]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[13]  Chong-Wah Ngo,et al.  Video Indexing, Search, Detection, and Description with Focus on TRECVID , 2017, ICMR.

[14]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[15]  Anna Podlesnaya,et al.  Deep Learning Based Semantic Video Indexing and Retrieval , 2016, IntelliSys.

[16]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[17]  Vasileios Mezaris,et al.  Video event detection using generalized subclass discriminant analysis and linear support vector machines , 2014, ICMR.

[18]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).