From image captioning to video summary using deep recurrent networks and unsupervised segmentation

Automatic captioning systems based on recurrent neural networks have been tremendously successful at providing realistic natural language captions for complex and varied image data. We explore methods for adapting existing models trained on large image caption data sets to a similar problem, that of summarising videos using natural language descriptions and frame selection. These architectures create internal high level representations of the input image that can be used to define probability distributions and distance metrics on these distributions. Specifically, we interpret each hidden unit inside a layer of the caption model as representing the un-normalised log probability of some unknown image feature of interest for the caption generation process. We can then apply well understood statistical divergence measures to express the difference between images and create an unsupervised segmentation of video frames, classifying consecutive images of low divergence as belonging to the same context, and those of high divergence as belonging to different contexts. To provide a final summary of the video, we provide a group of selected frames and a text description accompanying them, allowing a user to perform a quick exploration of large unlabeled video databases.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Heng Tao Shen,et al.  Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning , 2017, IJCAI.

[6]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[9]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Série Scientifique Scientific Series Incorporating Second-order Functional Knowledge for Better Option Pricing Incorporating Second-order Functional Knowledge for Better Option Pricing , 2022 .

[12]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[13]  Frank Nielsen,et al.  A family of statistical symmetric divergences based on Jensen's inequality , 2010, ArXiv.

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Xin Pan,et al.  YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.