论文信息 - Unsupervised Linking of Visual Features to Textual Descriptions in Long Manipulation Activities

Unsupervised Linking of Visual Features to Textual Descriptions in Long Manipulation Activities

We present a novel unsupervised framework, which links continuous visual features and symbolic textual descriptions of manipulation activity videos. First, we extract the semantic representation of visually observed manipulations by applying a bottom-up approach to the continuous image streams. We then employ a rule-based reasoning to link visual and linguistic inputs. The proposed framework allows robots 1) to autonomously parse, classify, and label sequentially and/or concurrently performed atomic manipulations (e.g., “cutting” or “stirring”), 2) to simultaneously categorize and identify manipulated objects without using any standard feature-based recognition techniques, and 3) to generate textual descriptions for long activities, e.g., “breakfast preparation.” We evaluated the framework using a dataset of <inline-formula> <tex-math notation="LaTeX">$\text{120}$</tex-math></inline-formula> atomic manipulations and <inline-formula> <tex-math notation="LaTeX">$\text{20}$</tex-math></inline-formula> long activities.

[1] Mirko Wächter,et al. The ArmarX Statechart Concept: Graphical Programing of Robot Behavior , 2016, Front. Robot. AI.

[2] Heinz Wörn,et al. Recognition and Understanding Situations and Activities with Description Logicsfor Safe Human-Robot Cooperation , 2010 .

[3] Jeffrey Mark Siskind,et al. A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video , 2015, J. Artif. Intell. Res..

[4] Yoshihiko Nakamura,et al. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[5] Eren Erdal Aksoy,et al. Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences , 2016, International Journal of Computer Vision.

[6] Eren Erdal Aksoy,et al. Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[7] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Kate Saenko,et al. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[9] Tae-Kyun Kim,et al. A syntactic approach to robot imitation learning using probabilistic activity grammars , 2013, Robotics Auton. Syst..

[10] Stevan Harnad. The Symbol Grounding Problem , 1999, ArXiv.

[11] Yoshihiko Nakamura,et al. Statistical Behavioral Understanding by Motion, Object, and Language , 2015 .

[12] Tomoaki Nakamura,et al. Mutual learning of an object concept and language model based on MLDA and NPYLM , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] Vladimir Zaytsev,et al. Generating Conceptual Metaphors from Proposition Stores , 2014, ArXiv.

[14] Liang Lin,et al. I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[15] Eren Erdal Aksoy,et al. Semantic parsing of human manipulation activities using on-line learned models for robot imitation , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[17] Christopher Joseph Pal,et al. Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism , 2015, ArXiv.

[18] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[19] Eren Erdal Aksoy,et al. Model-free incremental learning of the semantics of manipulation actions , 2015, Robotics Auton. Syst..

[20] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[21] Abdullah Al Mamun,et al. Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.

[22] Babette Dellen,et al. Depth-supported real-time video segmentation with the Kinect , 2012, 2012 IEEE Workshop on the Applications of Computer Vision (WACV).

[23] Charles J. Fillmore,et al. THE CASE FOR CASE. , 1967 .

[24] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[25] Johan Bos,et al. Wide-Coverage Semantic Analysis with Boxer , 2008, STEP.

[26] Mark Steedman,et al. Object-Action Complexes: Grounded abstractions of sensory-motor processes , 2011, Robotics Auton. Syst..