FSD-10: A Dataset for Competitive Sports Content Analysis

Action recognition is an important and challenging problem in video analysis. Although the past decade has witnessed progress in action recognition with the development of deep learning, such process has been slow in competitive sports content analysis. To promote the research on action recognition from competitive sports video clips, we introduce a Figure Skating Dataset (FSD-10) for finegrained sports content analysis. To this end, we collect 1484 clips from the worldwide figure skating championships in 2017-2018, which consist of 10 different actions in men/ladies programs. Each clip is at a rate of 30 frames per second with resolution 1080 $\times$ 720. These clips are then annotated by experts in type, grade of execution, skater info, .etc. To build a baseline for action recognition in figure skating, we evaluate state-of-the-art action recognition methods on FSD-10. Motivated by the idea that domain knowledge is of great concern in sports field, we propose a keyframe based temporal segment network (KTSN) for classification and achieve remarkable performance. Experimental results demonstrate that FSD-10 is an ideal dataset for benchmarking action recognition algorithms, as it requires to accurately extract action motions rather than action poses. We hope FSD-10, which is designed to have a large collection of finegrained actions, can serve as a new challenge to develop more robust and advanced action recognition models.

[1]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[2]  Joachim Gudmundsson,et al.  Spatio-Temporal Analysis of Team Sports , 2016, ACM Comput. Surv..

[3]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[6]  David E. Smith,et al.  Reasoning About Action I: A Possible Worlds Approach , 1987, Artif. Intell..

[7]  Huang-Chia Shih,et al.  A Survey of Content-Aware Video Analysis for Sports , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Rikio Onai,et al.  Human Action Recognition Based on Integrating Body Pose, Part Shape, and Motion , 2018, IEEE Access.

[9]  Brendan Tran Morris,et al.  Action Quality Assessment Across Multiple Actions , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Brendan Tran Morris,et al.  What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  M. Tuck,et al.  The Theory of Reasoned Action: A Decision Theory of Crime , 2017 .

[14]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[15]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[20]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sergey Levine,et al.  Unsupervised Learning via Meta-Learning , 2018, ICLR.

[22]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[23]  Xiaoming Liu,et al.  Sports Videos in the Wild (SVW): A video dataset for sports analysis , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Geoffrey E. Hinton,et al.  Gated Softmax Classification , 2010, NIPS.

[27]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Huai Li,et al.  Artificial convolution neural network for medical image pattern recognition , 1995, Neural Networks.

[29]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[34]  Annelies Knoppers,et al.  Race, ethnicity, and content analysis of the sports media: a critical reflection , 2010 .

[35]  Jerry D. Gibson,et al.  Handbook of Image and Video Processing , 2000 .

[36]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[38]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[39]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Sabine Süsstrunk,et al.  Standard RGB Color Spaces , 1999, CIC.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[43]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[45]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.