New colour fusion deep learning model for large-scale action recognition

In this work we propose a fusion methodology that takes advantage of multiple deep convolutional neural network (CNN) models and two colour spaces RGB and oRGB to improve action recognition performance on still images. We trained our deep CNNs on both the RGB and oRGB colour spaces, extracted and fused all the features, and forwarded them to an SVM for classification. We evaluated our proposed fusion models on the Stanford 40 Action dataset and the People Playing Musical Instruments (PPMI) dataset using two metrics: overall accuracy and mean average precision (mAP). Our results prove to outperform the current state-of-the-arts with 84.24% accuracy and 83.25% mAP on Stanford 40 and 65.94% accuracy and 65.85% mAP on PPMI. Furthermore, we also evaluated the individual class performance on both datasets. The mAP for top 20 individual classes on Stanford 40 lies between 97% and 87%, on PPMI the individual mAP class performance lies between 87% and 34%.

[1]  Amir Rosenfeld,et al.  Visual Concept Recognition and Localization via Iterative Introspection , 2016, ACCV.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Chien-Cheng Tseng,et al.  Image-based vehicle tracking and classification on the highway , 2010, The 2010 International Conference on Green Circuits and Systems.

[7]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[9]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Cordelia Schmid,et al.  Expanded Parts Model for Semantic Description of Humans in Still Images , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  F. Xavier Roca,et al.  On Importance of Interactions and Context in Human Action Recognition , 2011, IbPRIA.

[15]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[16]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[17]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Thomas Blaschke,et al.  Object based image analysis for remote sensing , 2010 .

[19]  Jianfei Cai,et al.  Action Recognition in Still Images With Minimum Annotation Efforts , 2016, IEEE Transactions on Image Processing.

[20]  Chengjun Liu,et al.  Novel Color LBP Descriptors for Scene and Image Texture Classification , 2022 .

[21]  Huimin Ma,et al.  Multi-scale region candidate combination for action recognition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[22]  Peter Shirley,et al.  oRGB: A Practical Opponent Color Space for Computer Graphics , 2009, IEEE Computer Graphics and Applications.

[23]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[24]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[25]  Huimin Ma,et al.  Generalized symmetric pair model for action classification in still images , 2017, Pattern Recognit..

[26]  Fahad Shahbaz Khan,et al.  Recognizing Actions Through Action-Specific Person Detection , 2015, IEEE Transactions on Image Processing.

[27]  Amir Rosenfeld,et al.  Action Classification via Concepts and Attributes , 2016, 2018 24th International Conference on Pattern Recognition (ICPR).

[28]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[29]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[30]  Raphaël Marée,et al.  Biomedical Image Classification with Random Subwindows and Decision Trees , 2005, CVBIA.

[31]  Nazli Ikizler-Cinbis,et al.  On Recognizing Actions in Still Images via Multiple Features , 2012, ECCV Workshops.

[32]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[33]  James M. Rehg,et al.  Traversability classification using unsupervised on-line visual learning for outdoor robot navigation , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[34]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[35]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Da-Wen Sun,et al.  Improving quality inspection of food products by computer vision: a review , 2004 .

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.