End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we also develop a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model. In order to demonstrate that the proposed approach indeed improves the performance of multiple video-to-language tasks, we participate in all the four tasks of LSMDC 2016 [18]. Our approach has won three of them, including fill-in-the-blank, multiple-choice test, and movie retrieval.

[1]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[2]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[3]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Tal Hassner,et al.  Temporal Tessellation for Video Annotation and Summarization , 2016, ArXiv.

[12]  Mubarak Shah,et al.  Video Fill in the Blank with Merging LSTMs , 2016, ArXiv.

[13]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[16]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tal Hassner,et al.  Temporal Tessellation: A Unified Approach for Video Analysis , 2016, ICCV.

[18]  U. Austin,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2017 .

[19]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[26]  Bernt Schiele,et al.  Coherent Multi-sentence Video Description with Variable Level of Detail , 2014, GCPR.

[27]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[28]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Anton van den Hengel,et al.  Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge , 2016, ArXiv.

[33]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[34]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[35]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[37]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[39]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.