Deciphering the communicative code in speech and gesture dialogues by autoencoding hypernetworks

What kinds of grammar or code are used in interactive communications with speech and gestures? How varied or invariant is this code among people in a language community? What types of communicative code facilitate the alignment of the speech and gesture for language understanding? To study these and other related questions we develop computational techniques using coding theory and machine learning that decipher the communicative code in embodied multimodal interaction. We use data from the SaGA (Bielefeld Speech and Gesture Alignment) corpus which consists of 25 dyads of naturalistic, yet controlled, and systematically annotated speech and gesture use, engaged in a spatial communication task (Luecking, 2010). For the work we present here, a sub-corpus of 5 dyads is employed (473 noun phrases, 288 gestures) combining three kinds of information. First, gesture coding including gestural representation techniques (e.g., drawing, placing) and morphological gesture features (e.g., handshape). Second, a transcription of the spoken words and dialogue contextual information (information state, thematization, elemental actions of direction giving). And third, a coding of the gestures' referent objects and their spatio-geometrical properties (dimensionality, symmetries, etc.). We formulate the gesture generation problem as an encoding problem and use the unsupervised, autoencoding technique, where the input vector x is transformed by some function f(.;W) to the output vector y which is the same as the input, i.e. y = f(x;W) = x. For transformation we use the hypernetwork graphical architecture. The hypernetwork is a hypergraph structure, where the edges are weighted and represent the subsets of the variables (variables). One advantage of the hypernetwork is that it can capture the compositional structures or code words (or construction grammar rules) in its hypergraph structure. We apply an expectation-maximization style of learning algorithm to build the best autoencoding hypernetwork for the observed gesture-speech dialogue data. Another advantage of the hypernetwork is its generativity, i.e. the hypernetwork model can generate the values of the unknown (unobserved) variables from those of the known (observed) variables by probabilistic inference. This feature is especially useful for artificial communicative agents since the learned hypernetwork can be used to synthesize the gestures for virtual avatars or humanoid robots.