Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces

The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking faces generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces.

[1]  M. Beauchamp,et al.  Multivariate fMRI responses in superior temporal cortex predict visual contributions to, and individual differences in, the intelligibility of noisy speech , 2023, NeuroImage.

[2]  Y. Foo,et al.  How do people respond to computer-generated versus human faces? A systematic review and meta-analyses , 2023, Computers in Human Behavior Reports.

[3]  L. Bernstein,et al.  Lipreading: A Review of Its Continuing Importance for Speech Recognition With an Acquired Hearing Loss and Possibilities for Effective Training , 2022, American journal of audiology.

[4]  Chenliang Xu,et al.  Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face , 2022, bioRxiv.

[5]  M. Pantic,et al.  Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans , 2021, bioRxiv.

[6]  M. Beauchamp,et al.  Intelligibility of audiovisual sentences drives multivoxel response patterns in human superior temporal cortex , 2021, NeuroImage.

[7]  A. Giraud,et al.  The phase of cortical oscillations determines the perceptual fate of visual cues in naturalistic audiovisual speech , 2020, Science Advances.

[8]  Antoine Provost,et al.  Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments , 2020, Scientific Reports.

[9]  Maneesh Agrawala,et al.  Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  J. Myerson,et al.  Age Differences in the Effects of Speaking Rate on Auditory, Visual, and Auditory-Visual Speech Perception , 2019, Ear and hearing.

[11]  Antoine J. Shahin Neural evidence accounting for interindividual variability of the McGurk illusion , 2019, Neuroscience Letters.

[12]  M. Beauchamp,et al.  Face viewing behavior predicts multisensory gain during speech perception , 2018, Psychonomic Bulletin & Review.

[13]  Subhransu Maji,et al.  Visemenet , 2018, ACM Trans. Graph..

[14]  Kristin J. Van Engen,et al.  Audiovisual sentence recognition not predicted by susceptibility to the McGurk effect , 2016, Attention, Perception, & Psychophysics.

[15]  Lawrence D. Rosenblum,et al.  Influences of selective adaptation on perception of audiovisual speech , 2016, J. Phonetics.

[16]  J. Peelle,et al.  Prediction and constraint in audiovisual speech perception , 2015, Cortex.

[17]  John F. Magnotti,et al.  Variability and stability in the McGurk effect: contributions of participants, stimuli, time, and response type , 2015, Psychonomic bulletin & review.

[18]  Antoine J. Shahin,et al.  Putative mechanisms mediating tolerance for audiovisual stimulus onset asynchrony. , 2015, Journal of neurophysiology.

[19]  Kristin J. Van Engen,et al.  Enhancing speech intelligibility: interactions among context, modality, speech style, and masker. , 2014, Journal of speech, language, and hearing research : JSLHR.

[20]  Wei Ji Ma,et al.  Causal inference of asynchronous audiovisual speech , 2013, Front. Psychol..

[21]  Mitchell Sommers,et al.  Auditory-visual discourse comprehension by older and young adults in favorable and unfavorable conditions , 2008, International journal of audiology.

[22]  L. Bernstein,et al.  Enhanced visual speech perception in individuals with early-onset hearing impairment. , 2007, Journal of speech, language, and hearing research : JSLHR.

[23]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[24]  K. Grant,et al.  Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. , 1998, The Journal of the Acoustical Society of America.

[25]  D. Massaro,et al.  Perceiving Talking Faces , 1995 .

[26]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[27]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[28]  P. Ekman,et al.  Measuring facial movement , 1976 .

[29]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[30]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[31]  Rebecca Hayden,et al.  The Relative Frequency of Phonemes in General-American English , 1950 .

[32]  Naomi Harte,et al.  Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[33]  K. Waters,et al.  Computer facial animation , 1996 .

[34]  J C Gore,et al.  Application of MRI to the analysis of speech production. , 1987, Magnetic resonance imaging.