Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism

Learning from music to visual storytelling of shots is an interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more attractive shot sequences from music, thereby enhancing the viewing and listening experience.

[1]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[2]  Thomas S. Huang,et al.  Estimation of the joint probability of multisensory signals , 2001, Pattern Recognit. Lett..

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[5]  Tao Mei,et al.  Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks , 2017, AAAI.

[6]  Yi-Hsuan Yang,et al.  Deep-net fusion to classify shots in concert videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Hong-Yuan Mark Liao,et al.  Learning to Visualize Music Through Shot Sequence for Automatic Concert Video Mashup , 2020, IEEE Transactions on Multimedia.

[8]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Vineeth N. Balasubramanian,et al.  Deep Model Compression: Distilling Knowledge from Noisy Teachers , 2016, ArXiv.

[10]  Yi-Hsuan Yang,et al.  Event Localization in Music Auto-tagging , 2016, ACM Multimedia.

[11]  Wei Tsang Ooi,et al.  MoViMash: online mobile video mashup , 2012, ACM Multimedia.

[12]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[13]  Shiming Xiang,et al.  Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection , 2018, ACM Multimedia.

[14]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[15]  Yangyang Shi,et al.  Knowledge Distillation for Recurrent Neural Network Language Modeling with Trust Regularization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Gunhee Kim,et al.  Expressing an Image Stream with a Sequence of Natural Sentences , 2015, NIPS.

[17]  Ke Chen,et al.  Structured Knowledge Distillation for Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yi-Hsuan Yang,et al.  Coherent Deep-Net Fusion To Classify Shots In Concert Videos , 2018, IEEE Transactions on Multimedia.

[19]  Marc Christie,et al.  Thinking Like a Director , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[20]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[21]  Yi-Hsuan Yang,et al.  Seethevoice: Learning from Music to Visual Storytelling of Shots , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[22]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[23]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[24]  Wei Tsang Ooi,et al.  Cloud Baking , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[25]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[26]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[27]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[28]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[29]  Gustavo Mercado,et al.  The Filmmaker's Eye : Learning (and Breaking) the Rules of Cinematic Composition , 2013 .

[30]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[31]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).