论文信息 - HICO: A Benchmark for Recognizing Human-Object Interactions in Images

HICO: A Benchmark for Recognizing Human-Object Interactions in Images

We introduce a new benchmark "Humans Interacting with Common Objects" (HICO) for recognizing human-object interactions (HOI). We demonstrate the key features of HICO: a diverse set of interactions with common object categories, a list of well-defined, sense-based HOI categories, and an exhaustive labeling of co-occurring interactions with an object category in each image. We perform an in-depth analysis of representative current approaches and show that DNNs enjoy a significant edge. In addition, we show that semantic knowledge can significantly improve HOI recognition, especially for uncommon categories.

[1] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Xue Li,et al. Action recognition in still images using a combination of human pose and context information , 2012, 2012 19th IEEE International Conference on Image Processing.

[3] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4] Václav Hlavác,et al. Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Nazli Ikizler-Cinbis,et al. Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7] Jian-Huang Lai,et al. Recognising Human-Object Interaction via Exemplar Based Modelling , 2013, 2013 IEEE International Conference on Computer Vision.

[8] Charless C. Fowlkes,et al. Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[9] Fei-Fei Li,et al. Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[10] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13] Luc Van Gool,et al. The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[14] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[18] Yang Wang,et al. Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19] Cordelia Schmid,et al. Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Thomas Mensink,et al. Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[21] Cordelia Schmid,et al. Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24] Samy Bengio,et al. Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[26] Raffaella Bernardi,et al. TUHOI: Trento Universal Human Object Interaction Dataset , 2014, VL@COLING.

[27] Slav Petrov,et al. Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[28] Subhransu Maji,et al. Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[29] Ivan Laptev,et al. Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[30] Alan L. Yuille,et al. Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[31] Andrew Y. Ng,et al. Parsing with Compositional Vector Grammars , 2013, ACL.

[32] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33] Raffaella Bernardi,et al. Exploiting language models to recognize unseen actions , 2013, ICMR '13.

[34] Yang Wang,et al. Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35] Guodong Guo,et al. A survey on still image based human action recognition , 2014, Pattern Recognit..

[36] Michael Felsberg,et al. Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[37] Pinar Duygulu Sahin,et al. Recognizing actions from still images , 2008, 2008 19th International Conference on Pattern Recognition.

[38] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[39] Cees Snoek,et al. COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Ivan Laptev,et al. Learning person-object interactions for action recognition in still images , 2011, NIPS.