Goal!! Event detection in sports video

Understanding complex events from unstructured video, like scoring a goal in a football game, is an extremely challenging task due to the dynamics, complexity and variation of video sequences. In this work, we attack this problem exploiting the capabilities of the recently developed framework of deep learning. We consider independently encoding spatial and temporal information via convolutional neural networks and fusion of features via regularized Autoencoders. To demonstrate the capacities of the proposed scheme, a new dataset is compiled, composed of goal and no-goal sequences. Experimental results demonstrate that extremely high classification accuracy can be achieved, from a dramatically limited number of examples, by leveraging pretrained models with fine-tuned fusion of spatio-temporal features. Introduction Analyzing unstructured video streams is a challenging task for multiple reasons [10]. A first challenge is associated with the complexity of real world dynamics that are manifested in such video streams, including changes in viewpoint, illumination and quality. In addition, while annotated image datasets are prevalent, a smaller number of labeled datasets are available for video analytics. Last, the analysis of massive, high dimensional video streams is extremely demanding, requiring significantly higher computational resources compared to still imagery [11]. In this work, we focus on the analysis of a particular type of videos showing multi-person sport activities and more specifically football (soccer) games. Sport videos in general are acquired from different vantage points and the decision of selecting a single stream for broadcasting is taken by the director. As a result, the broadcasted video stream is characterized by varying acquisition conditions like zooming-in near the goalpost during a goal and zooming-out to cover the full field. In this complex situation, we consider the high level objective of detecting specific and semantically meaningful events like an opponent team scoring a goal. Succeeding in this task will allow the automatic transcription of games, video summarization and automatic statistical analysis. Despite the many challenges associated with video analytics, the human brain is able to extract meaning and provide contextual information in a limited amount of time and from a limited set of training examples. From a computational perspective, the process of event detection in a video sequence amounts to two foundamental steps, namely (i) spatio-temporal feature extraction and (ii) example classification. Typically, feature extraction approaches rely on highly engineered handcrafted features like the SIFT, which however are not able to generalize to more challenging cases. To achieve this objective, we consider the state-of-theart framework of deep learning [18] and more specifically the case of Convolutional Neural Networks (CNNs) [16], which has taken by storm almost all problems related to computer vision, ranging from image classification [15, 16], to object detection [17], and multi-modal learning [6]. At the same time, the concept of Autoencoders, a type of neural network which tries to appropriate the input at the output via regularization with various constrains, is also attracting attention due to its learning capacity in cases of unsupervised learning [21]. While significant effort has been applied in designing and evaluating deep learning architectures for image analysis, leading to highly optimized architectures, the problem of video analysis is at the forefront of research, where multiple avenues are explored. The urgent need for video analytics is driven by both the wealth of unstructured videos available online, as well as the complexities associated with adding the temporal dimension. In this work, we consider the problem of goal detection in broadcasted low quality football videos. The problem is formulated as a binary classification of short video sequences which are encoded though a spatiotemporal deep feature learning network. The key novelties of this work are to: • Develop a novel dataset for event detection in sports video and more specifically, for goal detection is football games; • Investigate deep learning architectures, such as CNN and Autoencoders, for achieving efficient event detection; • Demonstrate that learning, and thus accurate event detection, can be achieved by leveraging information from a few labeled examples, exploiting pre-trained models. State-of-the-art For video analytics, two major lines of research have been proposed, namely frame-based and motion-based, where in the former case, features are extracted from individual frames, while in the latter case, additional information regarding the inter-frame motion, like optical flow [3], is also introduced. In terms of single frame spatial feature extraction, CNNs have had a profound impact in image recognition, scene classification, and object detection, among others [16]. To account for the dynamic nature of video, a recently proposed concept involves extenting the two-dimensional convolution to three dimensions, leading to 3D CNNs, where temporal information is included as a distinct input [12, 13]. An alternative approach for encoding the temporal informarion is through the use of Long-Short Term Memory (LSTM) networks [1, 13], while another concept involves the generation of dynamic images through the collapse of multiple video frames and the use of 2D deep feature exaction on such representations [7]. In [2], temporal information is encoded through average pooling of frame-based descriptors and Figure 1: Block diagram of the proposed Goal detection framework. A 20-frame moving window initially selects part of the sequence of interest, and the selected frames undergo motion estimation. Raw pixel values and optical flows are first independently encoded using the pre-trained deep CNN for extracting spatial and temporal features. The extracted features can either be introduced into a higher level network for fusion which is fine-tuned for the classification problem, or concatenated and used as extended input features for the classification. the subsequent encoding in Fisher and VLAD vectors. In [4], the authors investigated deep video representation for action recognition, where temporal information was introduced in the frame-diff layer of the deep network architecture, through different temporal pooling strategies applied in patch-level, frame-level, and temporal window-level. One of the most successful frameworks for encoding both spatial and temploral information is the two-stream CNN [8]. Two-stream networks consider two sources of information, raw frames and optical flow, which are independently encoded by a CNN and fused into an SVM classifier. Further studies on this framework demonstrated that using pre-trained models can have a dramatic impact on training time, for the spatial and temporal features [22], while convolutional two-stream network fusion was recently applied in video action recognition [23]. The combination of 3D convolutions and the two-stream approach was also recently reported for video classification, achieving state-of-theart performance at significantly lower processing times [24]. The performance demonstrated by the two-streams approach for video analysis led to the choice of this paradigm in this work. Event Detection Network The proposed temporal event detection network is modeled as a two-stream deep network, coupled with a sparsity regularized Autoencoder for fusion of spatial and temporal data. We investigate Convolutional and Autoencoder Neural Networks for the extraction of spatial, temporal and fused spatio-temporal features and the subsequent application of kernel based Support Vector Machines for the binary detection of goal events. A high level overview of the processing pipeline is shown in Figure 1. While in fully-connected networks each hidden activation is computed by multiplying the entire input by the corresponding weights in that layer, in CNNs each hidden activation is computed by multiplying a small local input against the weights. The typical structure of a CNN consists of a number of convolution and pooling/subsampling layers, optionally followed by fully connected layers. At each convolution layer, the outputs of the previous layer are convolved with learnable kernels and passed through the activation function to form this layer’s output feature map. Let n× n be a square region extracted from a training input image X ∈ RN×M , and w be a filter of kernel size (m×m). The output of the convolutional layer h ∈ R(n−m+1)×(n−m+1) is given by: hi j = σ (m−1 ∑ a=0 m−1 ∑ b=0 wabx (i+a)( j+b)+b ` i j ) , (1) where b is the additive bias term, and σ(·) stands for the neuron’s activation unit. Specifically, the activation function σ , is a standard way to model a neuron’s output, as a function of its input. Convenient choices for the activation function include the logistic sigmoid, the hyperbolic tangent, and the Rectified Linear Unit. Taking into consideration the training time required for the gradient descent process, the saturating (i.e tanh, and logistic sigmoid) non-linearities are much slower than the non-saturating ReLU function. The output of the convolutional layer is directly utilized as input to a sub-sampling layer that produces downsampled versions of the input maps. There are several types of pooling, two common types of which are max-pooling and average-pooling, which partition the input image into a set of non-overlapping or overlapping patches and output the maximum or average value for each such sub-region. For the 2D feature extraction networks, we consider the VGG-16 CNN architecture, which is composed of 13 convolutional layers, with five of them being followed by a max-pooling layer, leading to three fully connected layers [9]. Unlike image detection problems, feature extraction in video must address the challenges associat

[1]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[5]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[6]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Shagan Sah,et al.  Image description through fusion based recurrent multi-modal learning , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[9]  Brendan J. Frey,et al.  k-Sparse Autoencoders , 2013, ICLR.

[10]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Panagiotis Tsakalides,et al.  Low Light Image Enhancement via Sparse Representations , 2014, ICIAR.

[16]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[18]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Andreas E. Savakis,et al.  Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks , 2016, ArXiv.

[20]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[21]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Michalis Zervakis,et al.  Deep learning for multi-label land cover classification , 2015, SPIE Remote Sensing.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[25]  Luc Van Gool,et al.  Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification , 2016, ArXiv.