Predicting head pose from speech

Speech animation, the process of animating a human-like model to give the impression it is talking, most commonly relies on the work of skilled animators, or performance capture. These approaches are time consuming, expensive, and lack the ability to scale. This thesis develops algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio. We achieve these goals by _rst forming a multi-modal corpus that represents the style of speech we want to model; speech that is natural, expressive and prosodic. This allows us to train deep recurrent neural networks to predict compelling animation. We _rst develop methods to predict the rigid head pose of a speaker. Predicting the head pose of a speaker from speech is not wholly deterministic, so our methods provide a large variety of plausible head pose trajectories from a single utterance. We then apply our methods to learn how to predict the head pose of the listener while in conversation, using only the voice of the speaker. Finally, we show how to predict the lip sync, facial expression, and rigid head pose of the speaker, simultaneously, solely from speech

[1]  Carlo Magi,et al.  Properties of line spectrum pair polynomials: a review , 2006 .

[2]  Paul Debevec,et al.  The Digital Emily project: photoreal facial modeling and animation , 2009, SIGGRAPH '09.

[3]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[4]  S. Kopp,et al.  The Effects of an Embodied Agent´s Nonverbal Behavior on User's Evaluation and Behavioral Mimicry , 2007 .

[5]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Mark Johnson,et al.  An Improved Non-monotonic Transition System for Dependency Parsing , 2015, EMNLP.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  H. Brenton,et al.  The Uncanny Valley : does it exist ? , 2005 .

[9]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[10]  A. Kendon Do Gestures Communicate? A Review , 1994 .

[11]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[12]  Yukiko I. Nakano,et al.  MACK: Media lab Autonomous Conversational Kiosk , 2002 .

[13]  S Goldin-Meadow,et al.  Silence is liberating: removing the handcuffs on grammatical expression in the manual modality. , 1996, Psychological review.

[14]  Jaakko Lehtinen,et al.  Production-level facial performance capture using deep convolutional neural networks , 2016, Symposium on Computer Animation.

[15]  Iain Matthews,et al.  Modeling and animating eye blinks , 2011, TAP.

[16]  H. Schussler,et al.  A stability theorem for discrete systems , 1976 .

[17]  Pascal Vincent,et al.  Dropout as data augmentation , 2015, ArXiv.

[18]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[19]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[20]  Fan Bo Head motion generation for speech-driven talking avatar , 2013 .

[21]  Atef Ben Youssef,et al.  Articulatory features for speech-driven head motion synthesis , 2013, INTERSPEECH.

[22]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[23]  Björn W. Schuller,et al.  Building autonomous sensitive artificial listeners (Extended abstract) , 2015, 2015 International Conference on Affective Computing and Intelligent Interaction (ACII).

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Roger K. Moore A Bayesian explanation of the ‘Uncanny Valley’ effect and related psychological phenomena , 2012, Scientific Reports.

[26]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[27]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[28]  Tara N. Sainath,et al.  Improvements to filterbank and delta learning within a deep neural network framework , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  B. Butterworth,et al.  Gesture, speech, and computational stages: a reply to McNeill. , 1989, Psychological review.

[30]  V. Yngve On getting a word in edgewise , 1970 .

[31]  Neil A. Macmillan,et al.  Detection Theory: A User's Guide , 1991 .

[32]  Gregor Hofer,et al.  Automatic head motion prediction from speech data , 2007, INTERSPEECH.

[33]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[34]  Robert C. Hubal,et al.  How do varied populations interact with embodied conversational agents? Findings from inner-city adolescents and prisoners , 2008, Comput. Hum. Behav..

[35]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[38]  Stephen D. Laycock,et al.  Predicting Head Pose from Speech with a Conditional Variational Autoencoder , 2017, INTERSPEECH.

[39]  Brian Butterworth,et al.  Gesture and Silence as Indicators of Planning in Speech , 1978 .

[40]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[41]  Zhigang Deng,et al.  Natural head motion synthesis driven by acoustic prosodic features , 2005, Comput. Animat. Virtual Worlds.

[42]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[43]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[44]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[45]  E. Vatikiotis-Bateson,et al.  Kinematics-Based Synthesis of Realistic Talking Faces , 1998, AVSP.

[46]  Timothy F. Cootes,et al.  Statistical models of appearance for medical image analysis and computer vision , 2001, SPIE Medical Imaging.

[47]  Björn Stenger,et al.  Expressive Visual Text-to-Speech Using Active Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[49]  J. Loomis,et al.  Interpersonal Distance in Immersive Virtual Environments , 2003, Personality & social psychology bulletin.

[50]  M. Mori THE UNCANNY VALLEY , 2020, The Monster Theory Reader.

[51]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[52]  Lei Xie,et al.  Head motion synthesis from speech using deep neural networks , 2015, Multimedia Tools and Applications.

[53]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[54]  Albrecht Rüdiger,et al.  Spectrum and spectral density estimation by the Discrete Fourier transform (DFT), including a comprehensive list of window functions and some new at-top windows , 2002 .

[55]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[56]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[57]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[58]  Stephen D. Laycock,et al.  Joint Learning of Facial Expression and Head Pose from Speech , 2018, INTERSPEECH.

[59]  Naomi H. Feldman,et al.  The influence of categories on perception: explaining the perceptual magnet effect as optimal statistical inference. , 2009, Psychological review.

[60]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[61]  Michael J. Black,et al.  Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion , 1995, Proceedings of IEEE International Conference on Computer Vision.

[62]  Stephen A. Zahorian,et al.  Yet Another Algorithm for Pitch Tracking , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  P. Ekman,et al.  Facial action coding system: a technique for the measurement of facial movement , 1978 .

[64]  Emile A. Hendriks,et al.  Action unit classification using active appearance models and conditional random fields , 2011, Cognitive Processing.

[65]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[66]  Hongdong Li,et al.  A simple prior-free method for non-rigid structure-from-motion factorization , 2012, CVPR.

[67]  A.R.D. Thornton,et al.  Foundations of Modern Auditory Theory , 1970 .

[68]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[69]  James J. Filliben,et al.  NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis , 2003 .

[70]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[71]  Frédéric H. Pighin,et al.  Expressive speech-driven facial animation , 2005, TOGS.

[72]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[73]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[74]  Arthur Schuster,et al.  On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena , 1898 .

[75]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[76]  E. Jentsch On the psychology of the uncanny (1906) 1 , 1997 .

[77]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[78]  Hiroshi Ishiguro,et al.  The Perception of Humans and Robots: Uncanny Hills in Parietal Cortex , 2010 .

[79]  Takaaki Kuratate,et al.  Audio-visual synthesis of talking faces from speech production correlates. , 1999 .

[80]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[81]  M. Black Avatars , 2008, BMJ : British Medical Journal.

[82]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[83]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[84]  Ira Kemelmacher-Shlizerman,et al.  Synthesizing Obama , 2017, ACM Trans. Graph..

[85]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[86]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[87]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[88]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[89]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[90]  John P. Lewis,et al.  Universal capture: image-based facial animation for "The Matrix Reloaded" , 2003, SIGGRAPH '03.

[91]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[92]  J. Cassell,et al.  Nudge nudge wink wink: elements of face-to-face conversation for embodied conversational agents , 2001 .

[93]  Dirk Heylen,et al.  Generation of Facial Expressions from Emotion Using a Fuzzy Rule Based System , 2001, Australian Joint Conference on Artificial Intelligence.

[94]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[95]  C. Hjortsjö Man's face and mimic language , 1969 .

[96]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[97]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[98]  Louis-Philippe Morency,et al.  Predicting Listener Backchannels: A Probabilistic Multimodal Approach , 2008, IVA.

[99]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[100]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[101]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[102]  Yifan Gong,et al.  An analysis of convolutional neural networks for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[103]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[104]  Carlos Busso,et al.  Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory , 2017, IVA.

[105]  Joakim Nivre,et al.  On the Semantics and Pragmatics of Linguistic Feedback , 1992, J. Semant..

[106]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[107]  Zhengyou Zhang,et al.  Flexible camera calibration by viewing a plane from unknown orientations , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[108]  Samuel R. Bowman,et al.  A Gold Standard Dependency Corpus for English , 2014, LREC.

[109]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[110]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[111]  Lei Xie,et al.  BLSTM neural networks for speech driven head motion synthesis , 2015, INTERSPEECH.

[112]  M. Schroeder Period histogram and product spectrum: new methods for fundamental-frequency measurement. , 1968, The Journal of the Acoustical Society of America.

[113]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[114]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[115]  Jinho D. Choi Dynamic Feature Induction: The Last Gist to the State-of-the-Art , 2016, NAACL.

[116]  Frank K. Soong,et al.  Text Driven 3D Photo-Realistic Talking Head , 2011, INTERSPEECH.

[117]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[118]  Hiroshi Shimodaira,et al.  Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis , 2016, IVA.

[119]  Zhigang Deng,et al.  Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[120]  Ray L. Birdwhistell,et al.  Introduction to kinesics : an annotation system for analysis of body motion and gesture , 1952 .

[121]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[122]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[123]  Stephen D. Laycock,et al.  Predicting Head Pose in Dyadic Conversation , 2017, IVA.

[124]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[125]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[126]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[127]  Etienne de Sevin,et al.  A listener model: introducing personality traits , 2012, Journal on Multimodal User Interfaces.

[128]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[129]  Theodore Raphan,et al.  Rotation axes of the head during positioning, head shaking, and locomotion. , 2007, Journal of neurophysiology.

[130]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[131]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[132]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[133]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[134]  K. Dautenhahn,et al.  Towards interactive robots in autism therapy: background, motivation and challenges , 2004 .

[135]  Simon Baker,et al.  Equivalence and efficiency of image alignment algorithms , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[136]  Timothy F. Cootes,et al.  Face Recognition Using Active Appearance Models , 1998, ECCV.

[137]  J. Graftieaux [The uncanny]. , 2011, Annales francaises d'anesthesie et de reanimation.

[138]  V. Tiwari MFCC and its applications in speaker recognition , 2010 .

[139]  J. Gower Generalized procrustes analysis , 1975 .

[140]  Thomas Gold,et al.  Hearing , 1953, Trans. IRE Prof. Group Inf. Theory.

[141]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[142]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[143]  T. Wickens Elementary Signal Detection Theory , 2001 .

[144]  Richard D. Hichwa,et al.  A neural basis for lexical retrieval , 1996, Nature.

[145]  Stéphane Bouchard,et al.  Virtual Reality Therapy Versus Cognitive Behavior Therapy for Social Phobia: A Preliminary Controlled Study , 2005, Cyberpsychology Behav. Soc. Netw..

[146]  Zhigang Deng,et al.  Audio-based head motion synthesis for Avatar-based telepresence systems , 2004, ETP '04.

[147]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.