Automatic Curation of Sports Highlights Using Multimodal Excitement Features

The production of sports highlight packages summarizing a game's most exciting moments is an essential task for broadcast media. Yet, it requires labor-intensive video editing. We propose a novel approach for auto-curating sports highlights, and demonstrate it to create a first of a kind, real-world system for the editorial aid of golf and tennis highlight reels. Our method fuses information from the players’ reactions (action recognition such as high-fives and fist pumps), players’ expressions (aggressive, tense, smiling, and neutral), spectators (crowd cheering), commentator (tone of the voice and word analysis), and game analytics to determine the most interesting moments of a game. We accurately identify the start and end frames of key shot highlights with additional metadata, such as the player's name and the whole number, or analysts input allowing personalized content summarization and retrieval. In addition, we introduce new techniques for learning our classifiers with reduced manual training data annotation by exploiting the correlation of different modalities. Our work has been demonstrated at a major golf tournament (2017 Masters) and two major international tennis tournaments (2017 Wimbledon and U.S. Open), successfully extracting highlights through the course of the sporting events. For the 2017 Masters, 54% of the clips selected by our system overlapped with the official highlights reels. Furthermore, user studies showed that 90% of the non-overlapping ones were of the same quality of the official clips for the 2017 Masters, while the automatic selection of clips for highlights of 2017 Wimbledon and 2017 US Open agreed with human preferences 80% and 84.2% of the time, respectively.

[1]  Stan Sclaroff,et al.  Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web , 2015, Pattern Recognit..

[2]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[3]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[4]  John R. Smith,et al.  Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation , 2017, ACM Multimedia.

[5]  Kiyoharu Aizawa,et al.  Automatic trailer generation , 2010, ACM Multimedia.

[6]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[9]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[10]  Ali Javed,et al.  An Efficient Framework for Automatic Highlights Generation from Sports Videos , 2016, IEEE Signal Processing Letters.

[11]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Jesse Davis,et al.  Predicting Soccer Highlights from Spatio-Temporal Match Event Streams , 2017, AAAI.

[13]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[14]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Irfan A. Essa,et al.  Leveraging Contextual Cues for Generating Basketball Highlights , 2016, ACM Multimedia.

[16]  Hossein Mobahi,et al.  Deep learning from temporal coherence in video , 2009, ICML '09.

[17]  Sebastian Boring,et al.  #EpicPlay: crowd-sourcing sports video highlights , 2012, CHI.

[18]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[20]  Byeong-Seob Ko,et al.  Sports highlights generation bas ed on acoustic events detection: A rugby case study , 2015, 2015 IEEE International Conference on Consumer Electronics (ICCE).

[21]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[23]  Regunathan Radhakrishnan,et al.  Audio events detection based highlights extraction from baseball, golf and soccer games in a unified framework , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[24]  John H. L. Hansen,et al.  Multi-modal highlight generation for sports videos using an information-theoretic excitability measure , 2013, EURASIP J. Adv. Signal Process..

[25]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Jing Wang,et al.  Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[29]  Regunathan Radhakrishnan,et al.  Generation of sports highlights using motion activity in combination with a common audio feature extraction framework , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[30]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Hongyuan Zha,et al.  Trailer Generation via a Point Process-Based Visual Attractiveness Model , 2015, IJCAI.

[32]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Qingming Huang,et al.  Highlight Summarization in Sports Video Based on Replay Detection , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[35]  Jun Yu,et al.  Local Deep-Feature Alignment for Unsupervised Dimension Reduction , 2018, IEEE Transactions on Image Processing.

[36]  Shih-Fu Chang,et al.  Event detection in baseball video using superimposed caption recognition , 2002, MULTIMEDIA '02.

[37]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[38]  John R. Smith,et al.  IBM High-Five: Highlights From Intelligent Video Engine , 2017, ACM Multimedia.

[39]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  John R. Smith,et al.  Automatic Curation of Golf Highlights Using Multimodal Excitement Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).