Head Movement Synthesis Based on Semantic and Prosodic Features for a Chinese Expressive Avatar

This paper proposes an approach for text-to-visual speech synthesis, where the synthetic head movements are rendered with an expressive talking avatar speaking Cantonese Chinese. The input text consists of descriptive information sourced from the Hong Kong tourism domain. The text is segmented into prosodic words (PW) and we adopt the PAD model to describe the expressivity of a prosodic word based on its semantics. Within the PW, we consider two prosodic features relevant to head movement synthesis, namely, the stress and tone of the Chinese syllable. We designed and recorded an audiovisual speech corpus and analyzed the data to derive statistical correspondences between different (P,A) values for a Chinese prosodic word and head movement coordinates. These statistics help parameter selection in a sinusoidal movement model. Corpus analyses also enable us to locate "peak points" of head movements that are synchronized with prosodic features within a prosodic word. These help the design of three heuristics that control head movements within a prosodic word. Perceptual evaluation based on the expressive talking avatar shows that head movement synthesis can raise the MOS by 1.04 points on average, when compared to the baseline which only shows lip articulations without head movements.