The effect of speaking rate on audio and visual speech

The speed that an utterance is spoken affects both the duration of the speech and the position of the articulators. Consequently, the sounds that are produced are modified, as are the position and appearance of the lips, teeth, tongue and other visible articulators. We describe an experiment designed to measure the effect of variable speaking rate on audio and visual speech by comparing sequences of phonemes and dynamic visemes appearing in the same sentences spoken at different speeds. We find that both audio and visual speech production are affected by varying the rate of speech, however, the effect is significantly more prominent in visual speech.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  N. Michael Brooke,et al.  Classification of lip-shapes and their association with acoustic speech events , 1990, SSW.

[3]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[4]  B. Lindblom Spectrographic Study of Vowel Reduction , 1963 .

[5]  Mark Liberman,et al.  Towards an integrated understanding of speaking rate in conversation , 2006, INTERSPEECH.

[6]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[7]  Jonas Beskow,et al.  Visual phonemic ambiguity and speechreading. , 2006, Journal of speech, language, and hearing research : JSLHR.

[8]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[10]  L. Bernstein,et al.  Speechreading and the structure of the lexicon: computationally modeling the effects of reduced phonetic distinctiveness on lexical uniqueness. , 1997, The Journal of the Acoustical Society of America.

[11]  Shashidhar G. Koolagudi,et al.  Robust Emotion Recognition using Speaking Rate Features , 2013 .

[12]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[13]  Yiqing,et al.  Speaking Rate Effects on Discourse Prosody in Standard Chinese , 2008 .

[14]  José Mario De Martino,et al.  Facial animation based on context-dependent visemes , 2006, Comput. Graph..

[15]  A. Montgomery,et al.  Visual intelligibility of consonants: a lipreading screening test with implications for aural rehabilitation. , 1976, The Journal of speech and hearing disorders.

[16]  Oscar N. Garcia,et al.  Continuous optical automatic speech recognition by lipreading , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[17]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[18]  Sharon Lesner,et al.  Visual vowel and diphthong perception across speakers , 1981 .

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Javier Melenchón,et al.  Objective viseme extraction and audiovisual uncertainty: estimation limits between auditory and visual modes , 2007, AVSP.

[21]  S A Lesner,et al.  Training influences on visual consonant and sentence recognition. , 1987, Ear and hearing.