Storytelling machines for video search

We study a fundamental question for developing storytelling machines: what vocabulary is suited for machines to tell the story of a video? We start by manually specifying the vocabulary concepts and their annotations. In order to effectively handcraft the vocabulary, we empirically study what are the best practices for handcrafting the vocabulary for video storytelling? From our analysis, we conclude that for an effective storytelling the vocabulary should encompass over thousands of concepts from various types, which are trained and normalized in an appropriate way. Creating such a handcrafted vocabulary of concepts is labor intensive. We alleviate the manual labor by addressing the next research question: can a machine learn novel video concepts by composition? We propose an algorithm, which learns new concepts by composing the vocabulary concepts by Boolean logic connectives. We demonstrate that the learned composite concepts enrich the vocabulary. As a further step towards reducing the manual labor for vocabulary construction, we investigate the question can a machine learn its vocabulary from human stories? We demonstrate that the human-written stories and their associated videos are precious resources for learning the vocabulary. Finally, we address the question how to learn the vocabulary from human stories? We formulate vocabulary construction as learning a multimodal embedding between visual features and terms from stories. The embedding is learned by minimizing a joint objective function balancing term descriptiveness and video predictability losses. As a result, the terms which are correlated in the stories are combined together to improve their video predictability.