A multimodal corpus for integrated language and action

We describe a corpus for research on learning everyday tasks in natural environments using the combination of natural language description and rich sensor data that we have collected for the CAET (Cognitive Assistant for Everyday Tasks) project. We have collected audio, video, Kinect RGB-Depth video and RFID object-touch data while participants demonstrate how to make a cup of tea. The raw data are augmented with gold-standard annotations for the language representation and the actions performed. We augment activity observation with natural language instruction to assist in task learning.