To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos

Thumbnails play such an important role in online videos. As the most representative snapshot, they capture the essence of a video and provide the first impression to the viewers; ultimately, a great thumbnail makes a video more attractive to click and watch. We present an automatic thumbnail selection system that exploits two important characteristics commonly associated with meaningful and attractive thumbnails: high relevance to video content and superior visual aesthetic quality. Our system selects attractive thumbnails by analyzing various visual quality and aesthetic metrics of video frames, and performs a clustering analysis to determine the relevance to video content, thus making the resulting thumbnails more representative of the video. On the task of predicting thumbnails chosen by professional video editors, we demonstrate the effectiveness of our system against six baseline methods, using a real-world dataset of 1,118 videos collected from Yahoo Screen. In addition, we study what makes a frame a good thumbnail by analyzing the statistical relationship between thumbnail frames and non-thumbnail frames in terms of various image quality features. Our study suggests that the selection of a good thumbnail is highly correlated with objective visual quality metrics, such as the frame texture and sharpness, implying the possibility of building an automatic thumbnail selection system based on visual aesthetics.

[1]  Qingming Huang,et al.  Query sensitive dynamic web video thumbnail generation , 2011, 2011 18th IEEE International Conference on Image Processing.

[2]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[3]  Shahriar Akramullah,et al.  Digital Video Concepts, Methods, and Metrics , 2014, Apress.

[4]  Luc Van Gool,et al.  Visual interestingness in image sequences , 2013, MM '13.

[5]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[6]  M.,et al.  Statistical and Structural Approaches to Texture , 2022 .

[7]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[8]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[9]  Mariska Kleemans,et al.  Sensationalism in television news: A review , 2009 .

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Mubarak Shah,et al.  A framework for photo-quality assessment and enhancement based on visual aesthetics , 2010, ACM Multimedia.

[12]  Mathias Lux,et al.  A novel tool for summarization of arthroscopic videos , 2009, Multimedia Tools and Applications.

[13]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yan Ke,et al.  The Design of High-Level Features for Photo Quality Assessment , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vicente Ordonez,et al.  High level describable attributes for predicting aesthetics and interestingness , 2011, CVPR 2011.

[17]  Jun Xiao,et al.  Thematic video thumbnail selection , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[18]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[19]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Miriam Redi,et al.  The beauty of capturing faces: Rating the quality of digital portraits , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[21]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[22]  Masashi Nishiyama,et al.  Aesthetic quality classification of photographs based on color harmony , 2011, CVPR 2011.

[23]  Yale Song,et al.  Mouse Activity as an Indicator of Interestingness in Video , 2016, ICMR.

[24]  Zhou Wang,et al.  No-reference perceptual quality assessment of JPEG compressed images , 2002, Proceedings. International Conference on Image Processing.

[25]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[28]  Rossano Schifanella,et al.  6 Seconds of Sound and Vision: Creativity in Micro-videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Matthew Anderson,et al.  Proposal for a Standard Default Color Space for the Internet - sRGB , 1996, CIC.

[30]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Mu Qiao,et al.  OSCAR: On-Site Composition and Aesthetics Feedback Through Exemplars for Photographers , 2012, International Journal of Computer Vision.

[36]  Ramin Zabih,et al.  A feature-based algorithm for detecting and classifying production effects , 1999, Multimedia Systems.

[37]  Meredith Ringel Morris,et al.  What do you see when you're surfing?: using eye tracking to predict salient regions of web pages , 2009, CHI.

[38]  W. Chu Studying Aesthetics in Photographic Images Using a Computational Approach , 2013 .

[39]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[40]  F. Dirfaux Key frame selection to represent a video , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[41]  ON-SITE COMPOSITION AND AESTHETCS FEEDBACK THROUGH EXEMPLARS FOR PHOTOGRAPHERS , 2017 .

[42]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[43]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Jun Gao,et al.  Learning to predict the perceived visual quality of photos , 2011, 2011 International Conference on Computer Vision.

[46]  David M. Nichols,et al.  How people find videos , 2008, JCDL '08.

[47]  Daqing He,et al.  Searching, browsing, and clicking in a search session: changes in user behavior by task and over time , 2014, SIGIR.