Predicting Head Pose in Dyadic Conversation

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek features from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue.

[1]  D. McNeill,et al.  Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information , 1998 .

[2]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Heloir,et al.  The Uncanny Valley , 2019, The Animation Studies Reader.

[4]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[5]  Seiichi Nakagawa,et al.  A Spoken Dialog System for Chat-Like Conversations Considering Response Timing , 2007, TSD.

[6]  Lei Xie,et al.  BLSTM neural networks for speech driven head motion synthesis , 2015, INTERSPEECH.

[7]  Simon Baker,et al.  Active Appearance Models Revisited , 2004, International Journal of Computer Vision.

[8]  K. Chang,et al.  Embodiment in conversational interfaces: Rea , 1999, CHI '99.

[9]  Zhigang Deng,et al.  Audio-based head motion synthesis for Avatar-based telepresence systems , 2004, ETP '04.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  Hiroshi Shimodaira,et al.  Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis , 2016, IVA.

[12]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[13]  Louis-Philippe Morency,et al.  Predicting Listener Backchannels: A Probabilistic Multimodal Approach , 2008, IVA.

[14]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.

[16]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[18]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[19]  Etienne de Sevin,et al.  A listener model: introducing personality traits , 2012, Journal on Multimodal User Interfaces.

[20]  Yifan Gong,et al.  An analysis of convolutional neural networks for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Joakim Nivre,et al.  On the Semantics and Pragmatics of Linguistic Feedback , 1992, J. Semant..

[22]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[26]  J. Gower Generalized procrustes analysis , 1975 .

[27]  V. Yngve On getting a word in edgewise , 1970 .

[28]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[29]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[30]  Tomio Watanabe,et al.  InterActor: Speech-Driven Embodied Interactive Actor , 2004, Int. J. Hum. Comput. Interact..