Unsupervised Linking of Visual Features to Textual Descriptions in Long Manipulation Activities

We present a novel unsupervised framework, which links continuous visual features and symbolic textual descriptions of manipulation activity videos. First, we extract the semantic representation of visually observed manipulations by applying a bottom-up approach to the continuous image streams. We then employ a rule-based reasoning to link visual and linguistic inputs. The proposed framework allows robots 1) to autonomously parse, classify, and label sequentially and/or concurrently performed atomic manipulations (e.g., “cutting” or “stirring”), 2) to simultaneously categorize and identify manipulated objects without using any standard feature-based recognition techniques, and 3) to generate textual descriptions for long activities, e.g., “breakfast preparation.” We evaluated the framework using a dataset of <inline-formula> <tex-math notation="LaTeX">$\text{120}$</tex-math></inline-formula> atomic manipulations and <inline-formula> <tex-math notation="LaTeX">$\text{20}$</tex-math></inline-formula> long activities.

[1]  Mirko Wächter,et al.  The ArmarX Statechart Concept: Graphical Programing of Robot Behavior , 2016, Front. Robot. AI.

[2]  Heinz Wörn,et al.  Recognition and Understanding Situations and Activities with Description Logicsfor Safe Human-Robot Cooperation , 2010 .

[3]  Jeffrey Mark Siskind,et al.  A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video , 2015, J. Artif. Intell. Res..

[4]  Yoshihiko Nakamura,et al.  Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions , 2015, Int. J. Robotics Res..

[5]  Eren Erdal Aksoy,et al.  Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences , 2016, International Journal of Computer Vision.

[6]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[7]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[9]  Tae-Kyun Kim,et al.  A syntactic approach to robot imitation learning using probabilistic activity grammars , 2013, Robotics Auton. Syst..

[10]  Stevan Harnad The Symbol Grounding Problem , 1999, ArXiv.

[11]  Yoshihiko Nakamura,et al.  Statistical Behavioral Understanding by Motion, Object, and Language , 2015 .

[12]  Tomoaki Nakamura,et al.  Mutual learning of an object concept and language model based on MLDA and NPYLM , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Vladimir Zaytsev,et al.  Generating Conceptual Metaphors from Proposition Stores , 2014, ArXiv.

[14]  Liang Lin,et al.  I2T: Image Parsing to Text Description , 2010, Proceedings of the IEEE.

[15]  Eren Erdal Aksoy,et al.  Semantic parsing of human manipulation activities using on-line learned models for robot imitation , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[17]  Christopher Joseph Pal,et al.  Video Description Generation Incorporating Spatio-Temporal Features and a Soft-Attention Mechanism , 2015, ArXiv.

[18]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Eren Erdal Aksoy,et al.  Model-free incremental learning of the semantics of manipulation actions , 2015, Robotics Auton. Syst..

[20]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Abdullah Al Mamun,et al.  Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.

[22]  Babette Dellen,et al.  Depth-supported real-time video segmentation with the Kinect , 2012, 2012 IEEE Workshop on the Applications of Computer Vision (WACV).

[23]  Charles J. Fillmore,et al.  THE CASE FOR CASE. , 1967 .

[24]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[25]  Johan Bos,et al.  Wide-Coverage Semantic Analysis with Boxer , 2008, STEP.

[26]  Mark Steedman,et al.  Object-Action Complexes: Grounded abstractions of sensory-motor processes , 2011, Robotics Auton. Syst..