Semantic Attribute Enriched Storytelling from a Sequence of Images

Visual storytelling (VST) pertains to the task of generating story-based sentences from an ordered sequence of images. Contemporary techniques suffer from several limitations such as inadequate encapsulation of visual variance and context capturing among the input sequence. Consequently, generated story from such techniques often lacks coherence, context and semantic information. In this research, we devise a ‘Semantic Attribute Enriched Storytelling’ (SAES) framework to mitigate these issues. To that end, we first extract the visual features of input image sequence and the noun entities present in the visual input by employing an off-the-shelf object detector. The two features are concatenated to encapsulate the visual variance of the input sequence. The features are then passed through a Bidirectional-LSTM sequence encoder to capture the past and future context of the input image sequence followed by attention mechanism to enhance the discriminality of the input to language model i.e., mogrifier-LSTM. Additionally, we incorporate semantic attributes e.g., nouns to complement the semantic context in the generated story. Detailed experimental and human evaluations are performed to establish competitive performance of proposed technique. We achieve up 1.4% improvement on BLEU metric over the recent state-of-art methods.