Generating commentaries for tennis videos

We present an approach to automatically generating verbal commentaries for tennis games. We introduce a novel application that requires a combination of techniques from computer vision, natural language processing and machine learning. A video sequence is first analysed using state-of-the-art computer vision methods to track the ball, fit the detected edges to the court model, track the players, and recognise their strokes. Based on the recognised visual attributes we formulate the tennis commentary generation problem in the framework of long short-term memory recurrent neural networks as well as structured SVM. In particular, we investigate pre-embedding of descriptive terms and loss function for LSTM. We introduce a new dataset of 633 annotated pairs of tennis videos and corresponding commentary. We perform an automatic as well as human based evaluation, and demonstrate that the proposed pre-embedding and loss function lead to substantially improved accuracy of the generated commentary.

[1]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  William J. Christmas,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Layered Data Association Using Graph-theoretic Formulation with Application to Tennis Ball Tracking in Monocular Sequences , 2022 .

[3]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[9]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[10]  Yo-Ping Huang,et al.  An intelligent strategy for the automatic detection of highlights in tennis video recordings , 2009, Expert Syst. Appl..

[11]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[12]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  David Windridge,et al.  Automatic annotation of tennis games: An integration of audio, vision, and learning , 2014, Image Vis. Comput..

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.