A low-latency lip-synchronized videoconferencing system

Audio is presented ahead of video in some videoconferencing systems since audio requires less time to process. Audio could be delayed to synchronize with video to achieve lip synchronization; however, the overall audio latency might then become unacceptable. We built a videoconferencing system to achieve lip synchronization with minimal perceived audio latency. Instead of adding a fixed audio delay, our system time-stretches the audio at the beginning of each utterance until the audio is synchronized with the video. We conducted user studies and found that (1) audio could lead video by roughly 50 msec and still be perceived as synchronized; (2) audio could lead video by 300 msec and still be perceived as synchronized if the audio was time-stretched to synchronization within a short period; and (3) our algorithm appears to strike a favorable balance between minimizing audio latency and supporting lip synchronization.

[1]  H Kunov,et al.  Disruptive effects of auditory signal delay on speech perception with lipreading. , 1986, The Journal of auditory research.

[2]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[3]  Ralf Steinmetz,et al.  Human Perception of Jitter and Media Synchronization , 1996, IEEE J. Sel. Areas Commun..

[4]  N. F. Dixon,et al.  The Detection of Auditory Visual Desynchrony , 1980, Perception.

[5]  R. Campbell,et al.  Hearing by Eye , 1980, The Quarterly journal of experimental psychology.

[6]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[7]  H.G. De Meer,et al.  Utility curves: mean opinion scores considered biased , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[8]  N. P. Erber,et al.  Voice/mouth synthesis and tactual/visual perception of /pa, ba, ma/. , 1978, The Journal of the Acoustical Society of America.

[9]  Brian C. J. Moore,et al.  Voice pitch as an aid to lipreading , 1981, Nature.

[10]  D W Massaro,et al.  Perception of asynchronous and conflicting visual and auditory speech. , 1996, The Journal of the Acoustical Society of America.

[11]  Milton Chen,et al.  Design of a virtual auditorium , 2001, MULTIMEDIA '01.

[12]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[13]  P. Gribble,et al.  Temporal constraints on the McGurk effect , 1996, Perception & psychophysics.

[14]  Dominic W. Massaro,et al.  Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllables , 1993, Speech Commun..

[15]  Thomas P. Caudell,et al.  Computational Requirements and Synchronization Issues for Virtual Acoustic Displays , 1998, Presence.

[16]  J. C. Cooper Video-to-Audio Synchrony Monitoring and Correction , 1988 .