论文信息 - ViStoryNet: Neural Networks with Successive Event Order Embedding and BiLSTMs for Video Story Regeneration

ViStoryNet: Neural Networks with Successive Event Order Embedding and BiLSTMs for Video Story Regeneration

A video is a vivid medium similar to human’s visual-linguistic experiences, since it can inculcate a sequence of situations, actions or dialogues that can be told as a story. In this study, we propose story learning/regeneration frameworks from videos with successive event order supervision for contextual coherence. The supervision induces each episode to have a form of trajectory in the latent space, which constructs a composite representation of ordering and semantics. In this study, we incorporated the use of kids videos as a training data. Some of the advantages associated with the kids videos include omnibus style, simple/explicit storyline in short, chronological narrative order, and relatively limited number of characters and spatial environments. We build the encoder-decoder structure with successive event order embedding, and train bi-directional LSTMs as sequence models considering multi-step sequence prediction. Using a series of approximately 200 episodes of kids videos named ‘Pororo the Little Penguin’, we give empirical results for story regeneration tasks and SEOE. In addition, each episode shows a trajectory-like shape on the latent space of the model, which gives the geometric information for the sequence models.

Byoung-Tak Zhang | Kyung Min Kim | Min-Oh Heo

[1] Mark A. Finlayson. Learning narrative structure from annotated folktales , 2012 .

[2] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3] Mark O. Riedl,et al. Event Representations for Automated Story Generation with Deep Neural Nets , 2017, AAAI.

[4] Raymond J. Mooney,et al. Using Sentence-Level LSTM Language Models for Script Inference , 2016, ACL.

[5] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6] Boyang Li,et al. Scheherazade: Crowd-Powered Interactive Narrative Generation , 2015, AAAI.

[7] Ben Taskar,et al. Movie/Script: Alignment and Parsing of Video and Text Transcription , 2008, ECCV.

[8] Rafael Bidarra,et al. A Survey on Story Generation Techniques for Authoring Computational Narratives , 2017, IEEE Transactions on Computational Intelligence and AI in Games.

[9] Mike Graham,et al. Linking Video and Text via Representations of Narrative , 2003 .

[10] Byoung-Tak Zhang,et al. Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos , 2015, AAAI.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Samy Bengio,et al. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[15] Raymond J. Mooney,et al. Learning Statistical Scripts with LSTM Recurrent Neural Networks , 2016, AAAI.

[16] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[17] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[18] Brian O'Neill,et al. Dramatis: A Computational Model of Suspense , 2014, AAAI.

[19] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.