Collective Sports: A multi-task dataset for collective activity recognition

Abstract Collective activity recognition is an important subtask of human action recognition, where the existing datasets are mostly limited. In this paper, we look into this issue and introduce the “Collective Sports (C-Sports)” dataset, which is a novel benchmark dataset for multi-task recognition of both collective activity and sports categories. Various state-of-the-art techniques are evaluated on this dataset, together with multi-task variants which demonstrate increased performance. From the experimental results, we can say that while sports categories of the videos are inferred accurately, there is still room for improvement for collective activity recognition, especially regarding the generalization ability beyond previously unseen sports categories. In order to evaluate this ability, we introduce a novel evaluation protocol called unseen sports, where the training and test are carried out on disjoint sets of sports categories. The relatively lower recognition performances in this evaluation protocol indicate that the recognition models tend to be influenced by the surrounding context, rather than focusing on the essence of the collective activities. We believe that C-Sports dataset will stir further interest in this research direction.

[1]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Ioannis A. Kakadiaris,et al.  Activity analysis in crowded environments using social cues for group discovery and human interaction modeling , 2014, Pattern Recognit. Lett..

[4]  Yao Lu,et al.  A two-level attention-based interaction model for multi-person activity recognition , 2018, Neurocomputing.

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[8]  Björn Ommer,et al.  Learning Latent Constituents for Recognition of Group Activities in Video , 2014, ECCV.

[9]  Nazli Ikizler-Cinbis,et al.  Region based multi-stream convolutional neural networks for collective activity recognition , 2019, J. Vis. Commun. Image Represent..

[10]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Wei-Shi Zheng,et al.  Fast Collective Activity Recognition Under Weak Supervision , 2020, IEEE Transactions on Image Processing.

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[15]  Tao Mei,et al.  A Diffusion and Clustering-Based Approach for Finding Coherent Motions and Understanding Crowd Scenes , 2016, IEEE Transactions on Image Processing.

[16]  Jianxin Wu,et al.  A Heat-Map-Based Algorithm for Recognizing Group Activities in Videos , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Yansong Tang,et al.  Learning Semantics-Preserving Attention and Contextual Interaction for Group Activity Recognition , 2019, IEEE Transactions on Image Processing.

[18]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[19]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[20]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[22]  Mohamed R. Amer,et al.  Monte Carlo Tree Search for Scheduling Activity Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Francesco Solera,et al.  Structured learning for detection of social groups in crowd , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[24]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[26]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Luc Van Gool,et al.  stagNet: An Attentive Semantic RNN for Group Activity Recognition , 2018, ECCV.

[29]  Wang Yan,et al.  Visual recognition by counting instances: A multi-instance cardinality potential kernel , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[31]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[33]  Lei Chen,et al.  Deep Structured Models For Group Activity Recognition , 2015, BMVC.

[34]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[35]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yang Wang,et al.  Beyond Actions: Discriminative Models for Contextual Group Activities , 2010, NIPS.

[37]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Jinhui Tang,et al.  Coherence Constrained Graph LSTM for Group Activity Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.