Speech intelligibility of virtual humans

Abstract The speech intelligibility benefit of visual speech cues during oral communication is well-established. Therefore, an ecologically valid approach of auditory assessment should include the processing of both auditory and visual speech cues. This study describes the development and evaluation of a virtual human speaker designed to present speech auditory-visually. A male and female virtual human speaker were created and evaluated in two experiments: a visual-only speech reading test of words and sentences and an auditory-visual speech intelligibility sentence test. A group of five hearing, skilled speech reading adults participated in the speech reading test whereas a group of young normal-hearing participants (N = 35) was recruited for the intelligibility test. Skilled speech readers correctly identified 57 to 67% of the words and sentences uttered by the virtual speakers. The presence of the virtual speaker improved the speech intelligibility of sentences in noise by 1.5 to 2 dB. These results demonstrate the potential applicability of virtual humans in future auditory-visual speech assessment paradigms.

[1]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[2]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[3]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[4]  A lip-reading assessment for profoundly deaf patients. , 1983, The Journal of laryngology and otology.

[5]  A. Macleod,et al.  Quantifying the contribution of vision to speech perception in noise. , 1987, British journal of audiology.

[6]  A. Boothroyd,et al.  Mathematical treatment of context effects in phoneme and word recognition. , 1988, The Journal of the Acoustical Society of America.

[7]  Lance Williams,et al.  Performance-driven facial animation , 1990, SIGGRAPH.

[8]  WilliamsLance Performance-driven facial animation , 1990 .

[9]  D. Massaro,et al.  Perception of Synthesized Audible and Visible Speech , 1990 .

[10]  A. Macleod,et al.  A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. , 1990, British journal of audiology.

[11]  Guido F. Smoorenburg,et al.  Viseme classifications of Dutch consonants and vowels , 1994 .

[12]  Jan Wouters,et al.  Vlaamse opname van woordenlijsten voor spraakaudiometrie , 1994 .

[13]  S. Soli,et al.  Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. , 1994, The Journal of the Acoustical Society of America.

[14]  Ken W. Grant,et al.  The recognition of isolated words and words in sentences: Individual variability in the use of sentence context , 1997 .

[15]  Christian Benoît,et al.  Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP , 1998, Speech Commun..

[16]  Ken W. Grant,et al.  The use of visible speech cues (speechreading) for directing auditory attention: Reducing temporal and spectral uncertainty in auditory detection of spoken sentences , 1998 .

[17]  K. Grant,et al.  Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. , 1998, The Journal of the Acoustical Society of America.

[18]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[19]  R. Campbell,et al.  Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex , 2000, Current Biology.

[20]  Jonas Beskow,et al.  Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired , 2002 .

[21]  Tomaso Poggio,et al.  Perceptual Evaluation of Video-Realistic Speech , 2003 .

[22]  Gregory McCarthy,et al.  Polysensory interactions along lateral temporal regions evoked by audiovisual speech. , 2003, Cerebral cortex.

[23]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[24]  Lynne E. Bernstein,et al.  Auditory speech detection in noise enhanced by lipreading , 2004, Speech Commun..

[25]  Joanna Light,et al.  Using visible speech to train perception and production of speech for individuals with hearing loss. , 2004, Journal of speech, language, and hearing research : JSLHR.

[26]  Sascha Fagel,et al.  An articulation model for audiovisual speech synthesis - Determination, adjustment, evaluation , 2004, Speech Commun..

[27]  Jonas Beskow Trainable Articulatory Control Models for Visual Speech Synthesis , 2004, Int. J. Speech Technol..

[28]  Ronald Fedkiw,et al.  Automatic determination of facial muscle activations from sparse motion capture marker data , 2005, ACM Trans. Graph..

[29]  Ronald A. Cole,et al.  Accurate visible speech synthesis based on concatenating variable length motion capture data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[30]  Gérard Bailly,et al.  Intelligibility of natural and 3d-cloned German speech , 2007, AVSP.

[31]  Lawrence D Rosenblum,et al.  Speech Perception as a Multimodal Phenomenon , 2008, Current directions in psychological science.

[32]  Astrid van Wieringen,et al.  LIST and LINT: Sentences and numbers for quantifying speech understanding in severely impaired listeners for Flanders and the Netherlands , 2008, International journal of audiology.

[33]  B. Dodd,et al.  Review of visual speech perception by hearing and hearing-impaired people: clinical implications. , 2009, International journal of language & communication disorders.

[34]  Joshua G. W. Bernstein,et al.  Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners. , 2009, The Journal of the Acoustical Society of America.

[35]  Asif A. Ghazanfar,et al.  The multisensory roles for auditory cortex in primate vocal communication , 2009, Hearing Research.

[36]  Björn Granström,et al.  SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support , 2009, EURASIP J. Audio Speech Music. Process..

[37]  Wei Ji Ma,et al.  Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space , 2009, PloS one.

[38]  Mitchell Sommers,et al.  Aging, Audiovisual Integration, and the Principle of Inverse Effectiveness , 2010, Ear and hearing.

[39]  Steve C. Maddock,et al.  Evaluation of A Viseme-Driven Talking Head , 2010, TPCG.

[40]  Jean-Pierre Gagné,et al.  Older adults expend more listening effort than young adults recognizing audiovisual speech in noise , 2011, International journal of audiology.

[41]  Astrid van Wieringen,et al.  Comparison of fluctuating maskers for speech recognition tests , 2011, International journal of audiology.

[42]  Luc H. Arnal,et al.  Transitions in neural oscillations reflect prediction errors generated in audiovisual speech , 2011, Nature Neuroscience.

[43]  Guillaume Gibert,et al.  Evaluating a synthetic talking head using a dual task: Modality effects on speech understanding and cognitive load , 2013, Int. J. Hum. Comput. Stud..

[44]  Asif A Ghazanfar,et al.  Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection , 2013, Proceedings of the National Academy of Sciences.

[45]  Wayne H. Ward,et al.  My Science Tutor: A Conversational Multimedia Virtual Tutor. , 2013 .

[46]  Gregory Hickok,et al.  An fMRI Study of Audiovisual Speech Perception Reveals Multisensory Interactions in Auditory Cortex , 2013, PloS one.

[47]  Astrid van Wieringen,et al.  Development and validation of the Leuven intelligibility sentence test with male speaker (LIST-m) , 2014, International journal of audiology.

[48]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[49]  Lina R. Kubli,et al.  Development of a test battery for evaluating speech perception in complex listening environments. , 2014, The Journal of the Acoustical Society of America.

[50]  Wesley Mattheyses,et al.  Audiovisual speech synthesis: An overview of the state-of-the-art , 2015, Speech Commun..

[51]  B. Kollmeier,et al.  International Collegium of Rehabilitative Audiology (ICRA) recommendations for the construction of multilingual speech tests , 2015, International journal of audiology.

[52]  J. Wouters,et al.  What can we expect of normally-developing children implanted at a young age with respect to their auditory, linguistic and cognitive skills? , 2015, Hearing Research.

[53]  A. Wingfield,et al.  Hearing Impairment and Cognitive Energy: The Framework for Understanding Effortful Listening (FUEL) , 2016, Ear and hearing.

[54]  John F Culling,et al.  Head orientation benefit to speech intelligibility in noise for cochlear implant users and in realistic listening conditions. , 2016, The Journal of the Acoustical Society of America.

[55]  Hui Chen,et al.  Evaluating a 3-D virtual talking head on pronunciation learning , 2018, Int. J. Hum. Comput. Stud..

[56]  M. Walger,et al.  Validating a Method to Assess Lipreading, Audiovisual Gain, and Integration During Speech Reception With Cochlear-Implanted and Normal-Hearing Subjects Using a Talking Head , 2017, Ear and hearing.