A Joint Framework for Athlete Tracking and Action Recognition in Sports Videos

Sports video analysis has received increasing attention in recent years. Athlete tracking and action recognition are its two major issues that are highly related to each other; however, they are individually considered and processed in the existing studies. In this paper, we propose a joint framework for athlete tracking and action recognition in sports videos. In athlete tracking, we propose a scaling and occlusion robust tracker, named scaling and occlusion robust compressive tracking (CT), to localize the position of specific athlete in each frame. It follows the approach of CT but extends it in two aspects, i.e., scale refinement as well as occlusion recovery. For the former, an objectness method, edge box, is adopted to generate proposals, which replace the fixed sampling boxes in CT and better fit the scales of the candidate objects. For the latter, a candidate obstruction-based solution is presented, which brings in additional trackers to detect possible obstructions and to relocate the target as occlusion ends. Regarding action recognition, we propose a long-term recurrent region-guided convolutional network, which recognizes pre-defined actions by modeling discriminative temporal cues of the tracking results. We employ SPP-net to extract the robust feature of the tracked region of each frame. The features of all the frames are then fed into a stack of recurrent sequence models to capture the long-term region-level information. We extensively evaluate the proposed approach on a newly collected sports video benchmark and on the off-the-shelf UIUC2 dataset, and the experimental results clearly show its effectiveness.

[1]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[2]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[3]  James J. Little,et al.  A Boosted Particle Filter: Multitarget Detection and Tracking , 2004, ECCV.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[6]  Daniel Link,et al.  Tracking of Ball and Players in Beach Volleyball Videos , 2014, PloS one.

[7]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Shai Avidan,et al.  Ensemble Tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Rui Caseiro,et al.  Exploiting the Circulant Structure of Tracking-by-Detection with Kernels , 2012, ECCV.

[10]  Michael Felsberg,et al.  The Visual Object Tracking VOT2015 Challenge Results , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Ming-Hsuan Yang,et al.  Robust Object Tracking with Online Multiple Instance Learning , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Robert B. Fisher,et al.  The PETS04 Surveillance Ground-Truth Data Sets , 2004 .

[14]  Vibhav Vineet,et al.  Struck: Structured Output Tracking with Kernels , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Gregory D. Hager,et al.  Efficient Region Tracking With Parametric Models of Geometry and Illumination , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[18]  Luca Bertinetto,et al.  End-to-End Representation Learning for Correlation Filter Based Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Qian Chen,et al.  Depth Assisted Occlusion Handling in Video Object Tracking , 2010, ISVC.

[20]  A. Amato,et al.  Solving the multi object occlusion problem in a multiple camera tracking system , 2009, Pattern Recognition and Image Analysis.

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yanxi Liu,et al.  Online selection of discriminative tracking features , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[25]  Da Tang,et al.  Combining Mean-Shift and Particle Filter for Object Tracking , 2011, 2011 Sixth International Conference on Image and Graphics.

[26]  Yi Wu,et al.  Online Object Tracking: A Benchmark , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yunhong Wang,et al.  Scaling and occlusion robust athlete tracking in sports videos , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Robert T. Collins,et al.  An Open Source Tracking Testbed and Evaluation Web Site , 2005 .

[30]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[31]  Ming-Hsuan Yang,et al.  Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[32]  Thomas Mauthner,et al.  Visual Tracking of Athletes in Beach Volleyball Using a Single Camera , 2007, Int. J. Comput. Sci. Sport.

[33]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[37]  Haibin Ling,et al.  Real time robust L1 tracker using accelerated proximal gradient approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Abhinav Gupta,et al.  Transferring Rich Feature Hierarchies for Robust Visual Tracking , 2015, ArXiv.

[40]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[41]  Yanxi Liu,et al.  Tracking Sports Players with Context-Conditioned Motion Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[45]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[46]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[48]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[50]  Horst Bischof,et al.  Real-Time Tracking via On-line Boosting , 2006, BMVC.

[51]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[52]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[53]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[55]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[56]  Nassir Navab,et al.  Rapid selection of reliable templates for visual tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[57]  Luca Bertinetto,et al.  Staple: Complementary Learners for Real-Time Tracking , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Lei Zhang,et al.  Real-Time Compressive Tracking , 2012, ECCV.

[59]  Feng Li,et al.  Blurred target tracking by Blur-driven Tracker , 2011, 2011 International Conference on Computer Vision.

[60]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[61]  Michael Harville,et al.  Adaptive video background modeling using color and depth , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[62]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[63]  Seunghoon Hong,et al.  Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network , 2015, ICML.

[64]  Qingming Huang,et al.  Learning Hierarchical Semantic Description Via Mixed-Norm Regularization for Image Understanding , 2012, IEEE Transactions on Multimedia.

[65]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[67]  Michael Isard,et al.  BraMBLe: a Bayesian multiple-blob tracker , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[68]  Luc Van Gool,et al.  Two-Stream SR-CNNs for Action Recognition in Videos , 2016, BMVC.

[69]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[70]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[72]  Xiaochun Cao,et al.  Robust Target Tracking by Online Random Forests and Superpixels , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[73]  Li Bai,et al.  Efficient Minimum Error Bounded Particle Resampling L1 Tracker With Occlusion Detection , 2013, IEEE Transactions on Image Processing.

[74]  Kaiqi Huang,et al.  An Equalized Global Graph Model-Based Approach for Multicamera Object Tracking , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[75]  Zixing Cai,et al.  Occlusion detection and recovery in video object tracking based on adaptive Particle filters , 2009, 2009 Chinese Control and Decision Conference.

[76]  Huchuan Lu,et al.  Visual tracking via adaptive structural local sparse appearance model , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Jorge Batista,et al.  Tracking Pedestrians Under Occlusion Using Multiple Cameras , 2004, ICIAR.

[79]  Yihong Gong,et al.  Human Tracking Using Convolutional Neural Networks , 2010, IEEE Transactions on Neural Networks.

[80]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.