Recognizing Manipulation Actions from State-Transformations

Manipulation actions transform objects from an initial state into a final state. In this paper, we report on the use of object state transitions as a mean for recognizing manipulation actions. Our method is inspired by the intuition that object states are visually more apparent than actions thus provide information that is complementary to spatiotemporal action recognition. We start by defining a state transition matrix that maps action verbs into a pre-state and a post-state. We extract keyframes at regular intervals from the video sequence and use these to recognize objects and object states. Change in object state are then used to predict action verbs. We report results on the EPIC kitchen action recognition challenge.

[1]  Song-Chun Zhu,et al.  VRKitchen: an Interactive 3D Virtual Environment for Task-oriented Learning , 2019, ArXiv.

[2]  James M. Rehg,et al.  Modeling Actions through State Changes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Rafael Grompone von Gioi,et al.  A psychophysical evaluation of the a contrario detection theory , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[6]  Ivan Laptev,et al.  Joint Discovery of Object States and Manipulation Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[8]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[9]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[10]  Justus H. Piater,et al.  25 Years of CNNs: Can We Compare to Human Abstraction Capabilities? , 2016, ICANN.

[11]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[12]  Cordelia Schmid,et al.  Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[14]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  James L. Crowley,et al.  Recognition and localization of food in cooking videos , 2018, CEA@IJCAI.

[17]  Ting Li,et al.  Comparing machines and humans on a visual categorization test , 2011, Proceedings of the National Academy of Sciences.

[18]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19]  Christian Wolf,et al.  Object Level Visual Reasoning in Videos , 2018, ECCV.