Perceptually motivated guidelines for voice synchronization in film

We consume video content in a multitude of ways, including in movie theaters, on television, on DVDs and Blu-rays, online, on smart phones, and on portable media players. For quality control purposes, it is important to have a uniform viewing experience across these various platforms. In this work, we focus on voice synchronization, an aspect of video quality that is strongly affected by current post-production and transmission practices. We examined the synchronization of an actor's voice and lip movements in two distinct scenarios. First, we simulated the temporal mismatch between the audio and video tracks that can occur during dubbing or during broadcast. Next, we recreated the pitch changes that result from conversions between formats with different frame rates. We show, for the first time, that these audio visual mismatches affect viewer enjoyment. When temporal synchronization is noticeably absent, there is a decrease in the perceived performance quality and the perceived emotional intensity of a performance. For pitch changes, we find that higher pitch voices are not preferred, especially for male actors. Based on our findings, we advise that mismatched audio and video signals negatively affect viewer experience.

[1]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[2]  P. Ekman,et al.  Emotion in the Human Face: Guidelines for Research and an Integration of Findings , 1972 .

[3]  N. Lass,et al.  An investigation of speaker photograph indentification. , 1976, The Journal of the Acoustical Society of America.

[4]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[5]  N. F. Dixon,et al.  The Detection of Auditory Visual Desynchrony , 1980, Perception.

[6]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[7]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[8]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[9]  Michel Chion,et al.  Audio-Vision: Sound on Screen , 1994 .

[10]  P. Gribble,et al.  Temporal constraints on the McGurk effect , 1996, Perception & psychophysics.

[11]  D H Brainard,et al.  The Psychophysics Toolbox. , 1997, Spatial vision.

[12]  D G Pelli,et al.  The VideoToolbox software for visual psychophysics: transforming numbers into movies. , 1997, Spatial vision.

[13]  Steven Greenberg,et al.  Speech intelligibility derived from asynchronous processing of auditory-visual information , 2001, AVSP.

[14]  Gavin C. Cawley,et al.  Evaluation of a talking head based on appearance models , 2003, AVSP.

[15]  David Poeppel,et al.  Discrimination of auditory-visual synchrony , 2003, AVSP.

[16]  David Poeppel,et al.  Visual speech speeds up the neural processing of auditory speech. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Poeppel,et al.  Temporal window of integration in auditory-visual speech perception , 2007, Neuropsychologia.

[18]  Akihiro Tanaka,et al.  Effect of speed difference between time-expanded speech and talker2s moving image on word or sentence intelligibility , 2007, AVSP.

[19]  Åsa Abelin Emotional McGurk effect - an experiment , 2007 .

[20]  Andrew Mason,et al.  Factors Affecting Perception of Audio-Video Synchronization in Television , 2008 .

[21]  Angela Tinwell,et al.  Survival horror games - an uncanny modality , 2009 .