Factor exploration of gestural stroke choice in the context of ambiguous instruction utterances: challenges to synthesizing semantic gesture from speech alone

Current models of gesture synthesis focus primarily on a speech signal to synthesize gestures. In this paper, we take a critical look at this approach from the point of view of gesture’s tendency to disambiguate the verbal component of the expression. We identify and contribute an analysis of three challenge factors for these models: 1) synthesizing gesture in the presence of ambiguous utterances seems to be a overwhelmingly useful case for gesture production yet is not at present supported by present day models of gesture generation, 2) finding the best f-formation to convey spatial gestural information like gesturing directions makes a significant difference for everyday users and must be taken into account, and 3) assuming that captured human motion is a plentiful and easy source for retargeting gestural motion may not yet take into account the readability of gestures under kinematically constrained feasibility spaces.Recent approaches to generate gesture for agents[1] and robots [2] treat gesture as co-speech that is strictly dependent on verbal utterances. Evidence suggests that gesture selection may leverage task context so it is not dependent on verbal utterance only. This effect is particularly evident when attempting to generate gestures from ambiguous verbal utterances (e.g. "You do this when you get to the fork in the road"). Decoupling this strict dependency may allow gesture to be synthesized for the purpose of clarification of the ambiguous verbal utterance.