Seethevoice: Learning from Music to Visual Storytelling of Shots

Types of shots in the language of film are considered the key elements used by a director for visual storytelling. In filming a musical performance, manipulating shots could stimulate desired effects such as manifesting the emotion or deepening the atmosphere. However, while the visual storytelling technique is often employed in creating professional recordings of a live concert, audience recordings of the same event often lack such sophisticated manipulations. Thus it would be useful to have a versatile system that can perform video mashup to create a refined video from such amateur clips. To this end, we propose to translate the music into a near-professional shot (type) sequence by learning the relation between music and visual storytelling of shots. The resulting shot sequence can then be used to better portray the visual storytelling of a song and guide the concert video mashup process. Our method introduces a novel probabilistic-based fusion approach, named as multi-resolution fused recurrent neural networks (MF-RNNs) with film-language, which integrates multi-resolution fused RNNs and a film-language model for boosting the translation performance. The results from objective and subjective experiments demonstrate that MF-RNNs with film-language can generate an appealing shot sequence with better viewing experience.

[1]  Thomas S. Huang,et al.  Estimation of the joint probability of multisensory signals , 2001, Pattern Recognit. Lett..

[2]  Yi-Hsuan Yang,et al.  Deep-net fusion to classify shots in concert videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Peter H. N. de With,et al.  Automatic mashup generation from multiple-camera concert recordings , 2010, ACM Multimedia.

[7]  D. Västfjäll,et al.  Emotional responses to music: the need to consider underlying mechanisms. , 2008, The Behavioral and brain sciences.

[8]  Nenghai Yu,et al.  MoVieUp: Automatic Mobile Video Mashup , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Chung-Hsien Wu,et al.  Survey on audiovisual emotion recognition: databases, features, and data fusion strategies , 2014, APSIPA Transactions on Signal and Information Processing.

[10]  Yi-Hsuan Yang,et al.  Event Localization in Music Auto-tagging , 2016, ACM Multimedia.

[11]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[12]  Gustavo Mercado,et al.  The Filmmaker's Eye : Learning (and Breaking) the Rules of Cinematic Composition , 2013 .

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[15]  Wei Tsang Ooi,et al.  MoViMash: online mobile video mashup , 2012, ACM Multimedia.