Action Categorisation in Multimodal Instructions

Situated agents interact both with their physical environment they are located in and with their conversational partners. As both the world and the language used in situated conversations are continuously changing, an agent must be able to adapt its grounded semantic representations by learning from new information. A pre-requisite for a dynamic, interactive approach to learning of grounded semantic representations is that an agent is equipped with a set of actions that define its strategies for identifying and connecting linguistic and perceptual information to its knowledge. In this talk we present our work on grounding spatial descriptions that argues that perceptual grounding is dynamic and adaptable to contexts. We describe a system called Kille which we use for interactive learning of objects and spatial relations from a human tutor. Finally, we describe our work on identifying interactive strategies of frame of reference assignment in spatial descriptions in a corpus of human-human dialogues and argue that there is no general preference for frame of reference assignment but this is linked to interaction strategies between agents that are adopted within a particular dialogue game.

[1]  Alfredo Arahuetes,et al.  The Importance of Being Earnest , 2019, Cultural Criticism in the Netherlands, 1933-1940.

[2]  David R. Traum,et al.  Dialogue Structure Annotation for Multi-Floor Interaction , 2018, LREC.

[3]  Alexander Mehler,et al.  A UIMA Database Interface for Managing NLP-related Text Annotations , 2018, LREC.

[4]  Cory J. Hayes,et al.  Evaluating Robot Behavior in Response to Natural Language , 2018, HRI.

[5]  Anton Leuski,et al.  Laying Down the Yellow Brick Road: Development of a Wizard-of-Oz Interface for Collecting Human-Robot Dialogue , 2017, ArXiv.

[6]  James Pustejovsky,et al.  Learning event representation: As sparse as possible, but not sparser , 2017, ArXiv.

[7]  Yannick Prié,et al.  Towards HMD-based Immersive Analytics , 2017 .

[8]  James Pustejovsky,et al.  Fine-grained event learning of human-object interaction with LSTM-CRF , 2017, ESANN.

[9]  G. Redeker,et al.  Text-Picture Relations in Multimodal Instructions , 2017 .

[10]  Teruko Mitamura,et al.  Event Detection Using Frame-Semantic Parser , 2017, NEWS@ACL.

[11]  Nicholas Roy,et al.  Grounding Abstract Spatial Concepts for Language Interaction with Robots , 2017, IJCAI.

[12]  David R. Traum,et al.  Exploring Variation of Natural Human Commands to a Robot in a Collaborative Navigation Task , 2017, RoboNLP@ACL.

[13]  Yejin Choi,et al.  Zero-Shot Activity Recognition with Verb Attribute Induction , 2017, EMNLP.

[14]  Alexander Mehler,et al.  Stolperwege: An App for a Digital Public History of the Holocaust , 2017, HT.

[15]  Pascal Guitton,et al.  Design of an annotation system for taking notes in virtual reality , 2017, 2017 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON).

[16]  Kalina Bontcheva,et al.  Collaborative Web-Based Tools for Multi-layer Text Annotation , 2017 .

[17]  Sotaro Kita,et al.  How Do Gestures Influence Thinking and Speaking? The Gesture-for-Conceptualization Hypothesis , 2017, Psychological review.

[18]  Benjamin Weyers,et al.  Utilizing immersive virtual reality in everydaywork , 2017, 2017 IEEE 3rd Workshop on Everyday Virtual Reality (WEVR).

[19]  A. W. Evans,et al.  Applying the Wizard-of-Oz Technique to Multimodal Human-Robot Dialogue , 2017, ArXiv.

[20]  Ielka van der Sluis,et al.  PAT workbench: Annotation and Evaluation of Text and Pictures in Multimodal Instructions , 2016, LT4DH@COLING.

[21]  Tolga Uslu,et al.  TextImager: a Distributed UIMA-based System for NLP , 2016, COLING.

[22]  James Pustejovsky,et al.  VoxSim: A Visual Platform for Modeling Motion Language , 2016, COLING.

[23]  Kristiina Jokinen,et al.  Body movements and laughter recognition: experiments in first encounter dialogues , 2016, MA3HMI@ICMI.

[24]  James Pustejovsky,et al.  ECAT: Event Capture Annotation Tool , 2016, ArXiv.

[25]  James Pustejovsky,et al.  Multimodal Semantic Simulations of Linguistically Underspecified Motion Events , 2016, Spatial Cognition.

[26]  Ielka van der Sluis,et al.  Text-Picture Relations in Cooking Instructions , 2016, LREC 2016.

[27]  James Pustejovsky,et al.  VoxML: A Visualization Modeling Language , 2016, LREC.

[28]  Frank Keller,et al.  Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings , 2016, NAACL.

[29]  Atsuo Takanishi,et al.  Quantitative Laughter Detection, Measurement, and Classification—A Critical Survey , 2016, IEEE Reviews in Biomedical Engineering.

[30]  Radoslaw Niewiadomski,et al.  Automated Laughter Detection From Full-Body Movements , 2016, IEEE Transactions on Human-Machine Systems.

[31]  C. Lawrence Zitnick,et al.  Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Eren Erdal Aksoy,et al.  Learning the Semantics of Manipulation Action , 2015, ACL.

[33]  Pietro Perona,et al.  Describing Common Human Visual Actions in Images , 2015, BMVC.

[34]  Kristiina Jokinen,et al.  Recognition of Human Body Movements for Studying Engagement in Conversational Video Files , 2015 .

[35]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[36]  Harry Bunt,et al.  Semantic Relations in Discourse: The Current State of ISO 24617-8 , 2015, ACL 2015.

[37]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[38]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Massimo Moneglia,et al.  The variation of action verbs in multilingual spontaneous speech corpora: Semantic typology and corpus design , 2014 .

[40]  Radoslaw Niewiadomski,et al.  Rhythmic Body Movements of Laughter , 2014, ICMI.

[41]  Volker Gast,et al.  Atomic: an open-source software platform for multi-level corpus annotation , 2014, KONVENS.

[42]  Massimo Moneglia,et al.  IMAGACT4ALL Mapping Spanish Varieties onto a Corpus-Based Ontology of Action , 2014, CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos.

[43]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Kallirroi Georgila,et al.  SimSensei kiosk: a virtual human interviewer for healthcare decision support , 2014, AAMAS.

[45]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[46]  David G. Rand,et al.  Why We Cooperate , 2014 .

[47]  P. Gärdenfors The Geometry of Meaning: Semantics Based on Conceptual Spaces , 2014 .

[48]  Sophie Rosset,et al.  Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice , 2013 .

[49]  William Curran,et al.  Laughter Type Recognition from Whole Body Motion , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[50]  Jeffrey Mark Siskind,et al.  Seeing What You're Told: Sentence-Guided Activity Recognition in Video , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Yiannis Aloimonos,et al.  Detection of Manipulation Action Consequences (MAC) , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Peter Gärdenfors,et al.  Using Conceptual Spaces to Model Actions and Events , 2012, J. Semant..

[54]  Peter Gärdenfors,et al.  Event structure, conceptual spaces and the semantics of verbs , 2012 .

[55]  Ziqi Zhang,et al.  Automatically Extracting Procedural Knowledge from Instructional Texts using Natural Language Processing , 2012, LREC.

[56]  Anetta Kopecka,et al.  Events of "Putting" and "Taking": A Crosslinguistic Perspective , 2012 .

[57]  Massimo Moneglia,et al.  The IMAGACT Cross-linguistic Ontology of Action. A new infrastructure for natural language disambiguation , 2012, LREC.

[58]  Costanza Navarretta,et al.  Feedback in Nordic First-Encounters: a Comparative Study , 2012, LREC.

[59]  Kristiina Jokinen,et al.  Investigating Engagement - intercultural and technological aspects of the collection, analysis, and use of the Estonian Multiparty Conversational video data , 2012, LREC.

[60]  Kôiti Hasida,et al.  ISO 24617-2: A semantically-based standard for dialogue annotation , 2012, LREC.

[61]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[62]  Massimo Moneglia,et al.  Natural Language Ontology of Action: A Gap with Huge Consequences for Natural Language Understanding and Machine Translation , 2011, LTC.

[63]  Harry Bunt,et al.  A Hierarchical Unification of LIRICS and VerbNet Semantic Roles , 2011, 2011 IEEE Fifth International Conference on Semantic Computing.

[64]  Paul Chapman,et al.  Empire 3D: A Collaborative Semantic Annotation Tool for Virtual Environments , 2011 .

[65]  Kraig Finstad,et al.  The Usability Metric for User Experience , 2010, Interact. Comput..

[66]  Monika Vöge Local identity processes in business meetings displayed through laughter in complaint sequences , 2010 .

[67]  Beth Levin,et al.  Reflections on Manner/Result Complementarity* , 2010 .

[68]  A. Gupta,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Kristiina Jokinen,et al.  Pointing Gestures and Synchronous Communication Management , 2009, COST 2102 Training School.

[70]  Pinar Duygulu Sahin,et al.  Recognizing actions from still images , 2008, 2008 19th International Conference on Pattern Recognition.

[71]  A. Majid,et al.  The cross-linguistic categorization of everyday events: A study of cutting and breaking , 2008, Cognition.

[72]  Harry Bunt,et al.  LIRICS Semantic Role Annotation: Design and Evaluation of a Set of Data Categories , 2008, LREC.

[73]  Jan Peters,et al.  Reinforcement learning of motor skills with policy gradients , 2008, Neural Networks.

[74]  G Rizzolatti,et al.  When pliers become fingers in the monkey motor system , 2008, Proceedings of the National Academy of Sciences.

[75]  David A. van Leeuwen,et al.  Automatic discrimination between laughter and speech , 2007, Speech Commun..

[76]  A. Wierzbicka,et al.  Semantics and cognition. , 2006, Wiley interdisciplinary reviews. Cognitive science.

[77]  Katsumi Tanaka,et al.  Annotation authoring in collaborative 3D virtual environments , 2005, ICAT '05.

[78]  James Pustejovsky,et al.  Evita: A Robust Event Recognizer For QA Systems , 2005, HLT.

[79]  M. Steehouder,et al.  Designing and evaluating procedural instructions with the four components model , 2005, IPCC 2005. Proceedings. International Professional Communication Conference, 2005..

[80]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[81]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[82]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[83]  Hiroaki Sato,et al.  The FrameNet Database and Software Tools , 2002, LREC.

[84]  Guido Bugmann,et al.  Training Personal Robots Using Natural Language Instruction , 2001, IEEE Intell. Syst..

[85]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[86]  Rolf A. Zwaan,et al.  PSYCHOLOGICAL SCIENCE Research Article THE EFFECT OF IMPLIED ORIENTATION DERIVED FROM VERBAL CONTEXT ON PICTURE RECOGNITION , 2022 .

[87]  Jeffrey Mark Siskind,et al.  Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic , 1999, J. Artif. Intell. Res..

[88]  Paul Piwek,et al.  Relating Imperatives to Action , 1998, Cooperative Multimodal Communication.

[89]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[90]  Karen A. Schriver Dynamics in Document Design: Creating Text for Readers , 1996 .

[91]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[92]  Anita Tailor,et al.  Introductory digital image processing: a remote sensing perspective: Jensen, J R Prentice-Hall, Englewood Cliffs, NJ, USA (1986) £51.30 pp 392 , 1986 .

[93]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  Keiichi Abe,et al.  Topological structural analysis of digitized binary images by border following , 1985, Comput. Vis. Graph. Image Process..

[95]  G. Jefferson Structures of Social Action: On the organization of laughter in talk about troubles , 1985 .

[96]  Yasuyuki Yoshida,et al.  Biomechanics for understanding movements in daily activities , 2018 .

[97]  James Pustejovsky,et al.  Teaching Virtual Agents to Perform Complex Spatial-Temporal Activities , 2018, AAAI Spring Symposia.

[98]  Hao Zhang,et al.  WordNet Troponymy and Extraction of “Manner-Result” Relations , 2018, GWC.

[99]  Mehdi Ghanimifard,et al.  Learning to Compose Spatial Relations with Grounded Neural Language Models , 2017, IWCS.

[100]  Nikhil Krishnaswamy,et al.  Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives , 2017 .

[101]  Thomas C. Schmidt,et al.  Tools for Multimodal Annotation , 2017 .

[102]  Ron Artstein,et al.  Towards Efficient Human-Robot Dialogue Collection : Moving Fido into the Virtual World , 2017 .

[103]  Ronald P. A. Petrick,et al.  Dialogues with Social Robots - Enablements, Analyses, and Evaluation, Seventh International Workshop on Spoken Dialogue Systems, IWSDS 2016, Saariselkä, Finland, January 13-16, 2016 , 2017, IWSDS.

[104]  James Pustejovsky,et al.  Object Embodiment in a Multimodal Simulation , 2016 .

[105]  B Maegaard,et al.  Acoustic Features of Different Types of Laughter in North Sami Conversational Speech , 2016 .

[106]  Francesca Bonin,et al.  Content and context in conversations : the role of social and situational signals in conversation structure , 2016 .

[107]  Iryna Gurevych,et al.  WebAnno: a flexible, web-based annotation tool for CLARIN , 2014 .

[108]  Yiannis Aloimonos,et al.  A Cognitive System for Understanding Human Manipulation Actions , 2014 .

[109]  Valeria Quochi,et al.  Translating Action Verbs using a Dictionary of Images: the IMAGACT Ontology , 2014 .

[110]  James Pustejovsky,et al.  Conceptual and representational choices in defining an ISO standard for semantic role annotation , 2013 .

[111]  Katja Hofmann,et al.  Cornetto: A Combinatorial Lexical Semantic Database for Dutch , 2013, Essential Speech and Language Technology for Dutch.

[112]  Kristiina Jokinen,et al.  Multimodal Open-Domain Conversations with the Nao Robot , 2012, Natural Interaction with Robots, Knowbots and Smartphones, Putting Spoken Dialog Systems into Practice.

[113]  Gloria Gagliardi,et al.  IMAGACT: Deriving an Action Ontology from Spoken Corpora , 2012, ACL 2012.

[114]  Nick Campbell,et al.  Acoustic Features of Four Types of Laughter in Natural Conversational Speech , 2011, ICPhS.

[115]  Catherine Pelachaud,et al.  Interacting with Embodied Conversational Agents , 2010 .

[116]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[117]  Christian Chiarcos,et al.  A Flexible Framework for Integrating Annotations from Different Tools and Tagsets , 2008 .

[118]  Steven Bird NLTK: The Natural Language Toolkit , 2006, ACL.

[119]  Gertjan van Noord,et al.  At Last Parsing Is Now Operational , 2006, JEPTALNRECITAL.

[120]  Janet Beavin Bavelas,et al.  Linguistic influences on gesture’s form , 2005 .

[121]  D. McNeill Gesture and Thought , 2005 .

[122]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[123]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[124]  Farida Aouladomar A Semantic Analysis of Instructional Texts , 2004 .

[125]  Michaël Steehouder,et al.  De verwerking van stapsgewijze instructies , 2000 .

[126]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[127]  G. Geerts,et al.  Algemene Nederlandse Spraakkunst [2 banden] , 1997 .

[128]  Anthony G. Cohn,et al.  A Spatial Logic based on Regions and Connection , 1992, KR.

[129]  Vijaykumar Gullapalli,et al.  A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[130]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[131]  G. Frege Über Sinn und Bedeutung , 1892 .

[132]  Andy Davis,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning , 2022 .

[133]  Dragomir R. Radev,et al.  of the Association for Computational Linguistics , 2022 .

[134]  Nikhil Krishnaswamy,et al.  Learning Actions from Events Using Agent Motions , 2022 .