Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions

We propose a method for describing human activities from video images based on concept hierarchies of actions. Major difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them, which is also known as inverse Hollywood problem. In general, the concepts of events or actions of human can be classified by semantic primitives. By associating these concepts with the semantic features extracted from video images, appropriate syntactic components such as verbs, objects, etc. are determined and then translated into natural language sentences. We also demonstrate the performance of the proposed method by several experiments.

[1]  Albert Sydney Hornby,et al.  Guide to Patterns and Usage in English , 1954 .

[2]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[3]  Naoyuki Okada,et al.  Conceptual Taxonomy of Japanese Adjectives for Understanding Natural Language and Picture Patterns , 1980, COLING.

[4]  Naoyuki Okada Conceptual Taxonomy Of Japanese Verbs For Understanding Natural Language And Picture Patterns , 1980, COLING.

[5]  Fujio Nishida,et al.  Japanese-English Translation Through Internal Expressions , 1982, COLING.

[6]  Tadaaki Tani,et al.  Feedback of correcting information in postediting to a machine translation system , 1988, COLING.

[7]  Hans-Hellmut Nagel,et al.  Association of Motion Verbs with Vehicle Movements Extracted from Dense Optical Flow Fields , 1994, ECCV.

[8]  Karl Rohr,et al.  Integrating Vision and Language: Towards Automatic Description of Human Movements , 1995, KI.

[9]  Noboru Babaguchi,et al.  Generation of sketch map image and its instructions to support the understanding of geographical information , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[10]  Noboru Babaguchi,et al.  Media information processing in documents-generation of manuals of mechanical parts assembling , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Intille,et al.  Representation and Visual Recognition of Complex , Multi-agent Actions using Belief , 1998 .

[12]  Kunio Fukunaga,et al.  Extracting Regions of Human Face and Hands considering Information of Color and Region Tracking , 1999 .

[13]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[14]  Kunio Fukunaga,et al.  Generating natural language description of human behavior from video images , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[15]  Mubarak Shah,et al.  Monitoring human behavior from video taken in an office environment , 2001, Image Vis. Comput..

[16]  Hans-Hellmut Nagel,et al.  A vision of ‘vision and language’ comprises action: An example from road traffic , 2004, Artificial Intelligence Review.

[17]  Naoyuki Okada Integrating vision, motion and language through mind , 2004, Artificial Intelligence Review.