Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments

As videos progressively take a central role in conveying information on the Web, current linear-consumption methods that involve spending time proportional to the duration of the video need to be revisited. In this work, we present NoVoExp, a method that enables a Non-linear Video Consumption Experience by generating a sequence of multimodal fragments that represents the content in different segments of the videos in a succinct fashion. These fragments aid understanding the content of the video without watching it in entirely and serve as pointers to different segments of the video, enabling a new mechanism to consume videos. We design several baselines by building on top of video captioning and video summarization works to understand the relative advantages and disadvantages of NoVoExp, and compare the performances across video durations (short, medium, long) and categories (entertainment, lectures, tutorials). We observe that the sequences of multimodal fragments generated by NoVoExp have higher relevance to the video and are more diverse yet coherent. Our extensive evaluation using automated metrics and human studies show that our fragments are not only good at representing the contents of the video, but also align well with targeted viewer preferences.

[1]  Christoph Meinel,et al.  Exploring multimodal video representation for action recognition , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[2]  Xin Wang,et al.  Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning , 2018, NAACL.

[3]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[4]  Uma Shanker Tiwary,et al.  Affect representation and recognition in 3D continuous valence–arousal–dominance space , 2016, Multimedia Tools and Applications.

[5]  Byoung-Tak Zhang,et al.  GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation , 2018, ArXiv.

[6]  Debi Prosad Dogra,et al.  Prediction of advertisement preference by fusing EEG response and sentiment analysis , 2017, Neural Networks.

[7]  Yang Li,et al.  Real-time personalized content catering via viewer sentiment feedback: a QoE perspective , 2015, IEEE Network.

[8]  Nahum Shimkin,et al.  ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[9]  Esa Rahtu,et al.  A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer , 2020, BMVC.

[10]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[11]  Vaibhav Rajan,et al.  VideoKen: Automatic Video Summarization and Course Curation to Support Learning , 2018, WWW.

[12]  Hongxiang Gu,et al.  From Thumbnails to Summaries-A Single Deep Neural Network to Rule Them All , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[13]  Derek Miller,et al.  Leveraging BERT for Extractive Text Summarization on Lectures , 2019, ArXiv.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[16]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[17]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yiyan Chen,et al.  Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning , 2019, MMAsia.

[20]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.

[21]  Mingda Zhang,et al.  Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[23]  G. Kesteven,et al.  The Coefficient of Variation , 1946, Nature.

[24]  James R. Glass,et al.  Deep multimodal semantic embeddings for speech and images , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Gabriel Fernandez,et al.  Video Shot Boundary Detection Based on Color Histogram , 2003, TREC Video Retrieval Evaluation.