INSIGHT@DCU TRECVID 2019: Video to Text

In this paper we describe the approach we developed for the TRECVID video to text task, specifically the free-text generation sub-task. This sub-task consists of generating a textual description using only the information that can be extracted from the videos. We tackle the problem using a commonly used BLSTM network with an alternate enhance mechanism. To improve the model we study the effect of using different datasets and features. One of the main problems of the video captioning challenge is the size of the vocabulary, which adds another level of complexity, as the model needs to produce a rich vocabulary without previous knowledge of the scene. Therefore, we also discuss the use of an image captioning module to guide the initial text obtained from the video.