Video classification and retrieval through spatio-temporal Radon features

Abstract The rise in the availability of video content for access via the Internet and the medium of television has resulted in the development of automatic search procedures to retrieve the desired video. Searches can be simplified and hastened by employing automatic classification of videos. This paper proposes a descriptor called the Spatio-Temporal Histogram of Radon Projections (STHRP) for representing the temporal pattern of the contents of a video and demonstrates its application to video classification and retrieval. The first step in STHRP pattern computation is to represent any video as Three Orthogonal Planes (TOPs), i.e., XY, XT and YT, signifying the spatial and temporal contents. Frames corresponding to each plane are partitioned into overlapping blocks. Radon projections are obtained over these blocks at different orientations, resulting in weighted transform coefficients that are normalized and grouped into bins. Linear Discriminant Analysis (LDA) is performed over these coefficients of the TOPs to arrive at a compact description of STHRP pattern. Compared to existing classification and retrieval approaches, the proposed descriptor is highly robust to translation, rotation and illumination variations in videos. To evaluate the capabilities of the invariant STHRP pattern, we analyse the performance by conducting experiments on the UCF-101, HMDB51, 10contexts and TRECVID data sets for classification and retrieval using a bagged tree model. Experimental evaluation of video classification reveals that STHRP pattern can achieve classification rates of 96.15%, 71.7%, 93.24% and 97.3% for the UCF-101, HMDB51,10contexts and TRECVID 2005 data sets respectively. We conducted retrieval experiments on the TRECVID 2005, JHMDB and 10contexts data sets and the results revealed that STHRP pattern is able to provide the videos relevant to the user's query in minimal time (0.05s) with a good precision rate.

[1]  Giovanni Maria Farinella,et al.  RECfusion: Automatic Video Curation Driven by Visual Content Popularity , 2015, ACM Multimedia.

[2]  David Suter,et al.  Fast Supervised Hashing with Decision Trees for High-Dimensional Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Joan Climent,et al.  Human action recognition by means of subtensor projections and dense trajectories , 2018, Pattern Recognit..

[4]  C. V. Jiji,et al.  Histogram of Radon Projections: A new descriptor for object detection , 2015, 2015 Fifth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[5]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Brett Stevens,et al.  Dynamic facial expression recognition using local patch and LBP-TOP , 2015, 2015 8th International Conference on Human System Interaction (HSI).

[9]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[10]  Liming Chen,et al.  LPQ Based Static and Dynamic Modeling of Facial Expressions in 3D Videos , 2013, CCBR.

[11]  Chiranjoy Chattopadhyay,et al.  Supervised framework for automatic recognition and retrieval of interaction: a framework for classification and retrieving videos with similar human interactions , 2016, IET Comput. Vis..

[12]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[14]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[15]  Salvatore Tabbone,et al.  Histogram of radon transform. A useful descriptor for shape retrieval , 2008, 2008 19th International Conference on Pattern Recognition.

[16]  Giovanni Maria Farinella,et al.  Organizing egocentric videos of daily living activities , 2017, Pattern Recognit..

[17]  Nicu Sebe,et al.  Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos , 2017, MMM.

[18]  Yongdong Zhang,et al.  Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base , 2015, IEEE Transactions on Multimedia.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Faride Jamali Bajestani,et al.  Human actions retrieval from video databases according to the temporal feature by using multiple SVM and SIFT descriptor , 2015, 2015 International Congress on Technology, Communication and Knowledge (ICTCK).

[21]  Jiwen Lu,et al.  Deep Video Hashing , 2017, IEEE Transactions on Multimedia.

[22]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[23]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[24]  Wei Liu,et al.  Hashing with Graphs , 2011, ICML.

[25]  Jing Zhang,et al.  Discriminative Part Selection for Human Action Recognition , 2018, IEEE Transactions on Multimedia.

[26]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Tomasz Arodz Invariant Object Recognition Using Radon-based Transform , 2005, Comput. Artif. Intell..

[28]  Alexander C. Berg,et al.  Combining multiple sources of knowledge in deep CNNs for action recognition , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  David Picard,et al.  Learning features combination for human action recognition from skeleton sequences , 2017, Pattern Recognit. Lett..

[31]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[33]  Kannappan Palaniappan,et al.  Vehicle detection and orientation estimation using the radon transform , 2013, Defense, Security, and Sensing.

[34]  Yibin Li,et al.  Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos , 2018, Pattern Recognit..

[35]  Salvatore Tabbone,et al.  Amplitude-only log Radon transform for geometric invariant shape descriptor , 2014, Pattern Recognit..

[36]  Hongxun Yao,et al.  Distinctive action sketch for human action recognition , 2018, Signal Process..

[37]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[40]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[42]  Abdesselam Bouzerdoum,et al.  Video Classification Based on Spatial Gradient and Optical Flow Descriptors , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[43]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[45]  Ling Shao,et al.  Supervised Local Descriptor Learning for Human Action Recognition , 2017, IEEE Transactions on Multimedia.

[46]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[47]  Michal Koperski Human action recognition in videos with local representation , 2017 .

[48]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..