Learning discriminative action and context representations for action recognition in still images

Action recognition in still images is a challenging task in computer vision. Recent successes in deep feature-learning advance this research, employing robust and rich-semantic feature representation. However, the issue that recognition fails when two action images share similar contexts is long-standing. In this paper, we employ metric learning method to address within-class and between-class confusions in action recognition. We propose a novel loss function, named composite-triplet loss. Supervised by this loss function, our method directly learns a similarity function from data. Employing a customized human action and context detection network, we obtain highly discriminative action image embeddings, which can be used in action image recognition and other tasks. Our approach is evaluated on the still-image action recognition task and the image caption generation task. On the PASCAL VOC dataset, our approach outperforms state-of-the-art methods, achieving 90.6% mean AP.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jianfei Cai,et al.  Action Recognition in Still Images With Minimum Annotation Efforts , 2016, IEEE Transactions on Image Processing.

[3]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jitendra Malik,et al.  Actions and Attributes from Wholes and Parts , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Fahad Shahbaz Khan,et al.  Recognizing Actions Through Action-Specific Person Detection , 2015, IEEE Transactions on Image Processing.

[8]  Minh Hoai,et al.  Regularized Max Pooling for Image Categorization , 2014, BMVC.

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[11]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[13]  Fei-Fei Li,et al.  Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[14]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[15]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[16]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[17]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Chen Huang,et al.  Human Attribute Recognition by Deep Hierarchical Contexts , 2016, ECCV.

[19]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20]  Shengcai Liao,et al.  Person re-identification by Local Maximal Occurrence representation and metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).