Speaker-Following Video Subtitles

We propose a new method for improving the presentation of subtitles in video (e.g., TV and movies). With conventional subtitles, the viewer has to constantly look away from the main viewing area to read the subtitles at the bottom of the screen, which disrupts the viewing experience and causes unnecessary eyestrain. Our method places on-screen subtitles next to the respective speakers to allow the viewer to follow the visual content while simultaneously reading the subtitles. We use novel identification algorithms to detect the speakers based on audio and visual information. Then the placement of the subtitles is determined using global optimization. A comprehensive usability study indicated that our subtitle placement method outperformed both conventional fixed-position subtitling and another previous dynamic subtitling method in terms of enhancing the overall viewing experience and reducing eyestrain.

[1]  Mingjing Li,et al.  Automated annotation of human faces in family albums , 2003, MULTIMEDIA '03.

[2]  J. Driver Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading , 1996, Nature.

[3]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[4]  K. Rayner The perceptual span and peripheral cues in reading , 1975, Cognitive Psychology.

[5]  G. Jaffré,et al.  Costume: a new feature for automatic video content indexing , 2004 .

[6]  Ramakant Nevatia,et al.  Multi-target tracking by on-line learned discriminative appearance models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Hwan-Gue Cho,et al.  An Automated Procedure for Word Balloon Placement in Cinema Comics , 2006, ISVC.

[8]  M. Just,et al.  The psychology of reading and language comprehension , 1986 .

[9]  M. Wallace,et al.  Unifying multisensory signals across time and space , 2004, Experimental Brain Research.

[10]  Soo-Hyun Park,et al.  A Smart Communication System for Avatar Agents in Virtual Environment , 2008, 2008 International Conference on Convergence and Hybrid Information Technology.

[11]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  F. Arman,et al.  A Statistical Approach to Scene Change Detection , 1995 .

[13]  M D Reddix,et al.  Eye movement control during reading: II. Frequency of refixating a word , 1989, Perception & psychophysics.

[14]  Trevor Darrell,et al.  Articulatory features for robust visual speech recognition , 2004, ICMI '04.

[15]  David Bull,et al.  Scene change detection algorithms for content-based video indexing and retrieval , 2001 .

[16]  Meng Wang,et al.  Dynamic captioning: video accessibility enhancement for hearing impairment , 2010, ACM Multimedia.

[17]  Markus Rupp,et al.  SCENE CHANGE DETECTION FOR H.264 USING DYNAMIC THRESHOLD TECHNIQUES , 2005 .

[18]  Václav Hlavác,et al.  Detector of Facial Landmarks Learned by the Structured Output SVM , 2012, VISAPP.

[19]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[20]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[21]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[22]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Gianluca Monaci Towards real-time audiovisual speaker localization , 2011, 2011 19th European Signal Processing Conference.

[24]  Ioannis Pitas,et al.  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications , 2002, EURASIP J. Adv. Signal Process..

[25]  David Salesin,et al.  Comic Chat , 1996, SIGGRAPH.