Pose-Guided R-CNN for Jersey Number Recognition in Sports

Recognizing player jersey number in sports match video streams is a challenging computer vision task. The human pose and view-point variations displayed in frames lead to many difficulties in recognizing the digits on jerseys. These challenges are addressed here using an approach that exploits human body part cues with a Region-based Convolutional Neural Network (R-CNN) variant for digit level localization and classification. The paper first adopts the Region Proposal Network (RPN) to perform anchor classification and bounding-box regression over three classes: background, person and digit. The person and digit proposals are geometrically related and fed to a network classifier. Subsequently, it introduces a human body key-point prediction branch and a pose-guided regressor to get better bounding-box offsets for generating digit proposals. A novel dataset of soccer-match video frames with corresponding multi-digit class labels, player and jersey number bounding boxes, and single digit segmentation masks is collected. Our framework outperforms all existing models on jersey number recognition task. This work will be essential to the automation of player identification across multiple sports, and releasing the dataset will ease future research on sports video analysis.

[1]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[2]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wen Gao,et al.  Jersey number detection in sports video for athlete identification , 2005, Visual Communications and Image Processing.

[4]  Xiangyang Xue,et al.  Arbitrary-Oriented Scene Text Detection via Rotation Proposals , 2017, IEEE Transactions on Multimedia.

[5]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[6]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tae-Hyun Oh,et al.  Part-Based Player Identification Using Deep Convolutional Representation and Multi-scale Pooling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Christoph Meinel,et al.  SEE: Towards Semi-Supervised End-to-End Scene Text Recognition , 2017, AAAI.

[10]  Kathryn Fraughnaugh,et al.  Introduction to graph theory , 1973, Mathematical Gazette.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Abhishek Dutta,et al.  The VGG Image Annotator (VIA) , 2019, ArXiv.

[13]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[15]  Jiri Matas,et al.  Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[17]  Karsten Müller,et al.  Soccer player recognition using spatial constellation features and jersey number recognition , 2017, Comput. Vis. Image Underst..

[18]  James J. Little,et al.  Learning to Track and Identify Players from Broadcast Sports Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  James J. Little,et al.  Identifying players in broadcast sports videos using conditional random fields , 2011, CVPR 2011.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Georges Quénot,et al.  From Text Detection in Videos to Person Identification , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[23]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Hrvoje Dujmić,et al.  Player Number Localization and Recognition in Soccer Video using HSV Color Space and Internal Contours , 2008 .

[26]  Chih-Yang Lin,et al.  Identification and tracking of players in sport videos , 2013, ICIMCS '13.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yaroslav Bulatov,et al.  Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , 2013, ICLR.

[29]  Karsten Müller,et al.  Soccer Jersey Number Recognition Using Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[30]  Changhu Wang,et al.  Jersey Number Recognition with Semi-Supervised Spatial Transformer Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Christophe De Vleeschouwer,et al.  Detection and recognition of sports(wo)men from multiple views , 2009, 2009 Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC).