Audio-visual Evaluation and Detection of Word Prominence in a Human-Machine Interaction Scenario

This paper investigates the audio-visual correlates and the detection of word prominence. Subjects were interacting with a computer in a small game which created a broad and a narrow focus condition. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, intensity, fundamental frequency and spectral emphasis were calculated. From the visual channel head movements and image transformation based features from the mouth region were extracted. First the results show that the extracted features are significantly different for the two focus conditions (broad and narrow). Based on classification results it is demonstrated that they can be differentiated without knowledge of the word identity with accuracies of approx. 80%. Furthermore, it is shown that the visual channel by itself yields accuracies notably better than chance (approx. 65%) and that a combination of both modalities increases performance to approx. 85%.

[1]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[2]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[3]  Volker Strom,et al.  Visual prosody: facial movements accompanying speech , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Jeffery A. Jones,et al.  Visual Prosody and Speech Intelligibility , 2004, Psychological science.

[6]  Björn Granström,et al.  Visual correlates to prominence in several expressive modes , 2006, INTERSPEECH.

[7]  Dorothea Kolossa,et al.  Audiovisual speech recognition with missing or unreliable data , 2009, AVSP.

[8]  Emiel Krahmer,et al.  Facial expression and prosodic prominence: Effects of modality and facial area , 2008, J. Phonetics.

[9]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[10]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[11]  Mattias Heldner,et al.  On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish , 2003, J. Phonetics.

[12]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[13]  Samer Al Moubayed,et al.  Effects of visual prominence cues on speech intelligibility , 2009, AVSP.

[14]  H. Hill,et al.  Visual Correlates of Prosodic Contrastive Focus in French: Description and Inter-Speaker Variability , 2006 .

[15]  Martin Heckmann,et al.  Listen to the parrot: Demonstrating the quality of online pitch and formant extraction via feature-based resynthesis , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Martin Heckmann,et al.  Combining rate and place information for robust pitch extraction , 2007, INTERSPEECH.

[17]  Andreas Stolcke,et al.  Dialogue act modeling for automatic tagging and recognition of conversational speech , 2000, CL.

[18]  Petra Wagner,et al.  Focus Perception and Prominence , 1998 .

[19]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[20]  Guillaume Gibert,et al.  Prosody for the eyes: quantifying visual prosody using guided principal component analysis , 2010, INTERSPEECH.

[21]  Hiroshi G. Okuno,et al.  Automatic speech recognition improved by two-layered audio-visual integration for robot audition , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[22]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.