Face reading from speech - predicting facial action units from audio cues

The automatic recognition of facial behaviours is usually achieved through the detection of particular FACS Action Unit (AU), which then makes it possible to analyse the affective behaviours expressed in the face. Despite the fact that advanced techniques have been proposed to extract relevant facial descriptors, the processing of real-life data, i. e., recorded in unconstrained environments, makes the automatic detection of FACS AU much more challenging compared to constrained recordings, such as posed faces, and even impossible when the corresponding parts of the face are masked or subject to low or no illumination. We present in this paper the very first attempt in using acoustic cues for the automatic detection of FACS AU, as an alternative way to obtain information from the face when such data are not available. Results show that features extracted from the voice can be effectively used to predict different types of FACS AU, and that the best performance are obtained for the prediction of the apex, in comparison to the prediction of onset, offset and occurrence.

[1]  Fabien Ringeval,et al.  Novel Metrics of Speech Rhythm for the Assessment of Emotion , 2012, INTERSPEECH.

[2]  D. Bolinger Intonation and Its Parts , 1985 .

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Elmar Nöth,et al.  The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, parkinson's & eating condition , 2015, INTERSPEECH.

[5]  Lijun Yin,et al.  FERA 2014 chairs' welcome , 2015, FG.

[6]  Florian Eyben,et al.  Real-time Speech and Music Classification by Large Audio Feature Space Extraction , 2015 .

[7]  Fabien Ringeval,et al.  The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load , 2014, INTERSPEECH.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Jean-Philippe Thiran,et al.  Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data , 2015, Pattern Recognit. Lett..

[10]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[11]  Elmar Nöth,et al.  The INTERSPEECH 2012 Speaker Trait Challenge , 2012, INTERSPEECH.

[12]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[13]  Björn W. Schuller,et al.  Introducing CURRENNT: the munich open-source CUDA recurrent neural network toolkit , 2015, J. Mach. Learn. Res..

[14]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[15]  Fabien Ringeval,et al.  Hilbert-Huang Transform for Non-Linear Characterization of Speech Rhythm , 2009 .

[16]  K. Scherer,et al.  Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. , 2012, Emotion.

[17]  Katherine B. Martin,et al.  Facial Action Coding System , 2015 .

[18]  Maja Pantic,et al.  The first facial expression recognition and analysis challenge , 2011, Face and Gesture 2011.

[19]  Roxane Bertrand,et al.  About the relationship between eyebrow movements and Fo variations , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[21]  Björn W. Schuller,et al.  Acoustic-Linguistic Recognition of Interest in Speech with Bottleneck-BLSTM Nets , 2011, INTERSPEECH.

[22]  Lijun Yin,et al.  FERA 2015 - second Facial Expression Recognition and Analysis challenge , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[23]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[24]  Friedhelm Schwenke,et al.  Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction , 2012, Lecture Notes in Computer Science.

[25]  Shrikanth S. Narayanan,et al.  Robust Unsupervised Arousal Rating:A Rule-Based Framework withKnowledge-Inspired Vocal Features , 2014, IEEE Transactions on Affective Computing.

[26]  Klaus R. Scherer,et al.  Subtly Different Positive Emotions Can Be Distinguished by Their Facial Expressions , 2011 .

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[29]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[30]  Björn Schuller,et al.  The Computational Paralinguistics Challenge , 2012 .

[31]  Fabien Ringeval,et al.  Emotion Recognition in the Wild: Incorporating Voice and Lip Activity in Multimodal Decision-Level Fusion , 2014, ICMI.

[32]  Frédéric Bimbot,et al.  Facial Expression Recognition from Speech , 2013 .