Towards Automatic Understanding of `Virtual Pointing’ in Interaction

When trying to convey, from memory, the placement of objects relative to each other, one can use descriptions such as “the one is about two centimeters to the left of the other, and roughly one centimeter higher”, or one can just place ones hands in a representation of this configuration and say something like “one is here and the other one is here”. The type of gesture used in these latter displays has been called “abstract dexis” (McNeill et al., 1993) or “virtual pointing” (Kibrik, 2011), and it has been observed that these gestures have the remarkable effect of creating extralinguistic spatial referents for objects that are mentioned in the discourse, but are not in fact currently present. These referents can later in discourse be used to re-refer to the same entity; in our example, this could be done via “and this one [accompanied by pointing gesture] is”. Lascarides and Stone (2009) make the interesting proposal that such gestures do indeed call attention to a real location in shared space (which they denote with variables such as ~ p), but carry their semantic load via a mapping (v) into the conveyed location (v(~ p)) in the described situation, where the identity of the mapping is contextually determined. Configurations of locations indicated via such gestures (e.g. a ~ p1 and a ~ p2) then achieve their iconic value as a depiction of a configuration between the locations they are mapped into (v(~ p1), v(~ p2)). We were interested in how stable over time and how precise in their iconicity such mappings are in actual instances of use, with a view at how automatic understanding of such speech/gesture ensembles could be realized. We elicited and recorded multimodal spatial scene descriptions, and measured precision by fitting a mapping between virtual referent locations and true object locations. We then used this mapping to retrieve from the set of all scenes the one that was being described. Using our matching method, we find that the gestures carry a good amount of spatial information for 45 out of 53 episodes. In current work, we are attempting to make this retrieval process incremental, and combine it with an understanding of the utterance that the gestures accompany.