A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods.

[1]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Mark W. Schmidt,et al.  Where are the Blobs: Counting by Localization with Point Supervision , 2018, ECCV.

[3]  Bolei Zhou,et al.  Deep Flow-Guided Video Inpainting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Dahua Lin,et al.  MovieNet: A Holistic Dataset for Movie Understanding , 2020, ECCV.

[5]  Riccardo Leonardi,et al.  Estimating cinematographic scene depth in movie shots , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[6]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[7]  Mubarak Shah,et al.  Classification of Cinematographic Shots Using Lie Algebra and its Application to Complex Event Recognition , 2014, IEEE Transactions on Multimedia.

[8]  Fabio Galasso,et al.  Adversarial Network Compression , 2018, ECCV Workshops.

[9]  Tao Mei,et al.  Relation Distillation Networks for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Sanja Fidler,et al.  DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Cordelia Schmid,et al.  Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[12]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[13]  Wei-Shi Zheng,et al.  Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  A. Murat Tekalp,et al.  Shot type classification by dominant color for sports video segmentation and summarization , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Xiangjian He,et al.  CAMHID: Camera Motion Histogram Descriptor and Its Application to Cinematographic Shot Classification , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17]  Arkadiusz Stopczynski,et al.  Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Bolei Zhou,et al.  A Graph-Based Framework to Bridge Movies and Synopses , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Micah Goldblum,et al.  Adversarially Robust Distillation , 2019, AAAI.

[22]  Yueting Zhuang,et al.  DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection , 2015, IEEE Transactions on Image Processing.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Qi Tian,et al.  A unified framework for semantic shot classification in sports video , 2002, IEEE Transactions on Multimedia.

[25]  Riccardo Leonardi,et al.  Classifying cinematographic shot types , 2011, Multimedia Tools and Applications.

[26]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[28]  Sheng Tang,et al.  Statistical Framework for Shot Segmentation and Classification in Sports Video , 2007, ACCV.

[29]  Bo Du,et al.  Fast Spatio-Temporal Residual Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yiannis Kompatsiaris,et al.  Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Wojciech Matusik,et al.  Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks , 2018, ECCV.

[33]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jiashi Feng,et al.  Central Similarity Quantization for Efficient Image and Video Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wanqing Li,et al.  Soccer Video Shot Classification Based on Color Characterization Using Dominant Sets Clustering , 2009, PCM.

[36]  Mohsen Ebrahimi Moghaddam,et al.  A new method for shot classification in soccer sports video based on SVM classifier , 2012, 2012 IEEE Southwest Symposium on Image Analysis and Interpretation.

[37]  Toshiaki Kondo,et al.  Video shot classification using 2D motion histogram , 2017, 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).

[38]  Rui Zhang,et al.  KDGAN: Knowledge Distillation with Generative Adversarial Networks , 2018, NeurIPS.

[39]  Alberto Signoroni,et al.  Shot Scale Analysis in Movies by Convolutional Neural Networks , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[40]  Jie Gu,et al.  Progressive Sparse Local Attention for Video Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Guoqiang Han,et al.  R³Net: Recurrent Residual Refinement Network for Saliency Detection , 2018, IJCAI.

[42]  Hui Jiang,et al.  Tennis video shot classification based on support vector machine , 2011, 2011 IEEE International Conference on Computer Science and Automation Engineering.

[43]  Xiao-Feng Tong,et al.  Shot classification in sports video , 2004, Proceedings 7th International Conference on Signal Processing, 2004. Proceedings. ICSP '04. 2004..

[44]  Tsuhan Chen,et al.  Learning to Segment a Video to Clips Based on Scene and Camera Motion , 2012, ECCV.

[45]  Dong Liu,et al.  Two-Stream Oriented Video Super-Resolution for Action Recognition , 2019, ArXiv.

[46]  Dacheng Tao,et al.  Adversarial Learning of Portable Student Networks , 2018, AAAI.

[47]  Jian Sun,et al.  Saliency Optimization from Robust Background Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Bolei Zhou,et al.  A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Zhiwei Xiong,et al.  Two-Stream Action Recognition-Oriented Video Super-Resolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Luc Van Gool,et al.  The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation , 2019, ArXiv.

[52]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[53]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Qingming Huang,et al.  Spatiotemporal CNN for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Chen Change Loy,et al.  Knowledge Distillation Meets Self-Supervision , 2020, ECCV.

[56]  Xiangjian He,et al.  Using context saliency for movie shot classification , 2011, 2011 18th IEEE International Conference on Image Processing.

[57]  Dahua Lin,et al.  Online Multi-modal Person Search in Videos , 2020, ECCV.

[58]  Xudong Lin,et al.  DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Zhuowen Tu,et al.  Deeply Supervised Salient Object Detection with Short Connections , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Loong Fah Cheong,et al.  Taxonomy of Directing Semantics for Film Shot Classification , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[61]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).