Discovering audio-visual associations in narrated videos of human activities

This research presents a novel method for learning the lexical semantics of action verbs. The primary focus is on actions that are directed towards objects, such as kicking a ball or pushing a chair. Specifically, this dissertation presents a robust and scalable method for acquiring grounded lexical semantics by discovering audio-visual associations in narrated videos. The narration associated with the video contains many words, including other verbs that are unrelated to the action. The actual name of the depicted action is only occasionally mentioned by the narrator. More generally, this research presents an algorithm that can reliably and autonomously discover an association between two events, such as the utterance of a verb and the depiction of an action, if the two events are only loosely correlated with each other. Semantics is represented in a grounded way by association sets, a collection of sensory inputs associated with a high level concept. Each association set associates video sequences that depict a given action with utterances of the name of the action. The association sets are discovered in an unsupervised way. This dissertation also shows how to extract features from the video and audio for this purpose. Extensive experimental results are presented. The experiments make use of several hours of video depicting a human performing 13 actions with 6 objects. In addition, the performance of the algorithm was also tested with data provided by an external research group. The unsupervised learning algorithm presented in this dissertation has been compared to standard supervised learning algorithms. This dissertation introduces a number of relevant experimental parameters and various new analysis techniques. The experimental results show that the algorithm presented in this dissertation successfully discovers the correct associations between video scenes and audio utterances in an unsupervised way despite the imperfect correlation between the video and audio. The algorithm outperforms standard supervised learning algorithms. Among other things, this research shows that the performance of the algorithm depends mainly on the strength of the correlation between video and audio, the length of the narration associated with each video scene and the total number of words in the language.

[1]  Qiong Liu,et al.  Interactive and Incremental Learning via a Multisensory Mobile Robot , 2001 .

[2]  Chuck Rieger,et al.  Parsing and comprehending with word experts (a theory and its realization) , 1982 .

[3]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[4]  John McCarthy,et al.  Programs with common sense , 1960 .

[5]  Terry Winograd,et al.  Thinking Machines: Can There Be? Are We? , 1990, Informatica.

[6]  A. Ijspeert,et al.  From Swimming to Walking with a Salamander Robot Driven by a Spinal Cord Model , 2007, Science.

[7]  Paulina Varshavskaya,et al.  Behavior-Based Early Language Development on a Humanoid Robot , 2002 .

[8]  Joseph Weizenbaum,et al.  and Machine , 1977 .

[9]  Marvin Minsky,et al.  Semantic Information Processing , 1968 .

[10]  David A. Forsyth,et al.  Searching Video for Complex Activities with Finite State Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  R. Amit,et al.  Learning movement sequences from demonstration , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[12]  Giulio Sandini,et al.  Learning about objects through action - initial steps towards artificial cognition , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[13]  Stephen E. Levinson,et al.  PQ−Learning: An Efficient Robot Learning Method for Intelligent Behavior Acquisition , 2001 .

[14]  Stephen E. Levinson,et al.  Mathematical Models for Speech Technology , 2005 .

[15]  Steve R. Howell,et al.  A Model of Grounded Language Acquisition: Sensorimotor Features Improve Lexical and Grammatical Learning. , 2005 .

[16]  Giulio Sandini,et al.  Developmental robotics: a survey , 2003, Connect. Sci..

[17]  Juyang Weng,et al.  A theory for mentally developing robots , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[18]  Larry R. Harris A System for Primitive Natural Language Acquisition , 1977, Int. J. Man Mach. Stud..

[19]  Michael T. Rosenstein,et al.  Symbol Grounding With Delay Coordinates , 2003 .

[20]  Stefan Schaal,et al.  A Connectionist Approach to Learn Association between Sentences and Behavioral Patterns of a Robot , 2004 .

[21]  Stephen E. Levinson,et al.  HMM-Based Concept Learning for a Mobile Robot , 2007, IEEE Transactions on Evolutionary Computation.

[22]  Peter Ford Dominey,et al.  Learning to talk about events from narrated video in a construction grammar framework , 2005, Artif. Intell..

[23]  Marvin Minsky,et al.  A framework for representing knowledge , 1974 .

[24]  Stevan Harnad,et al.  Symbol grounding problem , 1990, Scholarpedia.

[25]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[26]  Jake K. Aggarwal,et al.  Human motion analysis: a review , 1997, Proceedings IEEE Nonrigid and Articulated Motion Workshop.

[27]  Pietro Perona,et al.  Human action recognition by sequence of movelet codewords , 2002, Proceedings. First International Symposium on 3D Data Processing Visualization and Transmission.

[28]  Terry Winograd,et al.  Understanding natural language , 1974 .

[29]  Aude Billard,et al.  Experiments in Learning by Imitation - Grounding and Use of Communication in Robotic Agents , 1999, Adapt. Behav..

[30]  Gennaro Chierchia,et al.  Meaning and grammar , 1990 .

[31]  Pankoo Kim,et al.  Human Activity Description Using Motion Verbs in WordNet , 2006, 2006 8th International Conference Advanced Communication Technology.

[32]  Tony X. Han,et al.  Efficient Nonparametric Belief Propagation with Application to Articulated Body Tracking , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Stephen Levinson,et al.  © 2006 by Matthew R McClain , 2006 .

[34]  Rodney A. Brooks,et al.  Elephants don't play chess , 1990, Robotics Auton. Syst..

[35]  Stephen E. Levinson,et al.  Automatic language acquisition by an autonomous robot , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[36]  James L. McClelland,et al.  Autonomous Mental Development by Robots and Animals , 2001, Science.

[37]  David R. Dowty,et al.  Introduction to Montague semantics , 1980 .

[38]  William A. Woods,et al.  Progress in natural language understanding: an application to lunar geology , 1973, AFIPS National Computer Conference.

[39]  Mariarosaria Taddeo,et al.  Solving the symbol grounding problem: a critical review of fifteen years of research , 2005, J. Exp. Theor. Artif. Intell..

[40]  L. Barsalou,et al.  Whither structured representation? , 1999, Behavioral and Brain Sciences.

[41]  Stefan Wermter,et al.  Learning robot actions based on self-organising language memory , 2003, Neural Networks.

[42]  Paul Vogt,et al.  The physical symbol grounding problem , 2002, Cognitive Systems Research.

[43]  William A. Woods,et al.  Procedural semantics for a question-answering machine , 1899, AFIPS Fall Joint Computing Conference.

[44]  Paul Davidsson Toward a general solution to the symbol grounding problem: combining machine learning and computer vision , 1993, AAAI 1993.

[45]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[46]  Bert F. Green,et al.  Baseball: an automatic question-answerer , 1899, IRE-AIEE-ACM '61 (Western).

[47]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[48]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[49]  James Cussens,et al.  Learning Language in Logic , 2001, Lecture Notes in Computer Science.

[50]  C. Breazeal Sociable Machines: Expressive Social Ex-change Between Humans and Robots , 2000 .

[51]  J. Piaget,et al.  The Child's Conception of the World , 1971 .

[52]  Juyang Weng,et al.  Grounded auditory development by a developmental robot , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[53]  F. Pulvermüller,et al.  Walking or Talking?: Behavioral and Neurophysiological Correlates of Action Verb Processing , 2001, Brain and Language.

[54]  Jeremy Wyatt,et al.  Learning Causality and Intention in Human Actions , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[55]  Norbert Wiener,et al.  Cybernetics: Control and Communication in the Animal and the Machine. , 1949 .

[56]  Angelo Cangelosi,et al.  An Embodied Model for Sensorimotor Grounding and Grounding Transfer: Experiments With Epigenetic Robots , 2006, Cogn. Sci..

[57]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[58]  Luc Berthouze,et al.  Introduction: The Third International Conference on Epigenetic Robotics , 2003 .

[59]  Stephen E. Levinson,et al.  Hmm-Based Semantic Learning for a Mobile Robot , 2004 .

[60]  R. Sun Symbol grounding: A new look at an old idea , 2000 .

[61]  Scott E. Fahlman,et al.  NETL: A System for Representing and Using Real-World Knowledge , 1979, CL.

[62]  Daniel G. Bobrow,et al.  Natural Language Input for a Computer Problem Solving System , 1964 .

[63]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[64]  B. Scassellati,et al.  Learning grounded semantics with word trees: Prepositions and pronouns , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[65]  G. Sandini,et al.  The iCub cognitive architecture: Interactive development in a humanoid robot , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[66]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[67]  A. Goldman,et al.  Mirror neurons and the simulation theory of mind-reading , 1998, Trends in Cognitive Sciences.

[68]  R. Brooks,et al.  The cog project: building a humanoid robot , 1999 .

[69]  William A. Woods,et al.  Semantics and Quantification in Natural Language Question Answering , 1986, Adv. Comput..

[70]  Mubarak Shah,et al.  A View-Invariant Representation of Human Action , 2000 .

[71]  Allen Newell,et al.  How can Merlin understand , 1973 .

[72]  Cristian Sminchisescu,et al.  Conditional Random Fields for Contextual Human Motion Recognition , 2005, ICCV.

[73]  Daniel C. Richardson,et al.  Spatial representations activated during real-time comprehension of verbs , 2003, Cogn. Sci..

[74]  Shih-Fu Chang,et al.  Discovering meaningful multimedia patterns with audio-visual concepts and associated text , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[75]  Robert Givan,et al.  Natural Language Syntax and First-Order Inference , 1992, Artificial Intelligence.

[76]  G. Rizzolatti,et al.  Listening to action-related sentences modulates the activity of the motor system: a combined TMS and behavioral study. , 2005, Brain research. Cognitive brain research.

[77]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[78]  Benjamin K. Bergen,et al.  Sentence Understanding Engages Motor Processes , 2005 .