ViStoryNet: Neural Networks with Successive Event Order Embedding and BiLSTMs for Video Story Regeneration

A video is a vivid medium similar to human’s visual-linguistic experiences, since it can inculcate a sequence of situations, actions or dialogues that can be told as a story. In this study, we propose story learning/regeneration frameworks from videos with successive event order supervision for contextual coherence. The supervision induces each episode to have a form of trajectory in the latent space, which constructs a composite representation of ordering and semantics. In this study, we incorporated the use of kids videos as a training data. Some of the advantages associated with the kids videos include omnibus style, simple/explicit storyline in short, chronological narrative order, and relatively limited number of characters and spatial environments. We build the encoder-decoder structure with successive event order embedding, and train bi-directional LSTMs as sequence models considering multi-step sequence prediction. Using a series of approximately 200 episodes of kids videos named ‘Pororo the Little Penguin’, we give empirical results for story regeneration tasks and SEOE. In addition, each episode shows a trajectory-like shape on the latent space of the model, which gives the geometric information for the sequence models.

[1]  Mark A. Finlayson Learning narrative structure from annotated folktales , 2012 .

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Mark O. Riedl,et al.  Event Representations for Automated Story Generation with Deep Neural Nets , 2017, AAAI.

[4]  Raymond J. Mooney,et al.  Using Sentence-Level LSTM Language Models for Script Inference , 2016, ACL.

[5]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Boyang Li,et al.  Scheherazade: Crowd-Powered Interactive Narrative Generation , 2015, AAAI.

[7]  Ben Taskar,et al.  Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[8]  Rafael Bidarra,et al.  A Survey on Story Generation Techniques for Authoring Computational Narratives , 2017, IEEE Transactions on Computational Intelligence and AI in Games.

[9]  Mike Graham,et al.  Linking Video and Text via Representations of Narrative , 2003 .

[10]  Byoung-Tak Zhang,et al.  Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos , 2015, AAAI.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[15]  Raymond J. Mooney,et al.  Learning Statistical Scripts with LSTM Recurrent Neural Networks , 2016, AAAI.

[16]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[17]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[18]  Brian O'Neill,et al.  Dramatis: A Computational Model of Suspense , 2014, AAAI.

[19]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.