Dynamic captioning: video accessibility enhancement for hearing impairment

There are more than 66 million people su®ering from hearing impairment and this disability brings them di±culty in the video content understanding due to the loss of audio information. If scripts are available, captioning technology can help them in a certain degree by synchronously illustrating the scripts during the playing of videos. However, we show that the existing captioning techniques are far from satisfactory in assisting hearing impaired audience to enjoy videos. In this paper, we introduce a video accessibility enhancement scheme with a Dynamic Captioning approach, which explores a rich set of technologies including face detection and recognition, visual saliency analysis, text-speech alignment, etc. Different from the existing methods that are categorized as static captioning here, dynamic captioning puts scripts at suitable positions to help hearing impaired audience better recognize the speaking characters. In addition, it progressively highlights the scripts word-by-word via aligning them with the speech signal and illustrates the variation of voice volume. In this way, the special audience can better track the scripts and perceive the moods that are conveyed by the variation of volume. We implement the technology on 20 video clips and conduct an in-depth study with 60 real hearing impaired users, and results have demonstrated the effectiveness and usefulness of the video accessibility enhancement scheme.

[1]  Dylan M. Jones,et al.  Interactive multimedia for instruction: A cognitive analysis of the role of audition and vision , 1992, Int. J. Hum. Comput. Interact..

[2]  Bernd Freisleben,et al.  Unfolding speaker clustering potential: a biomimetic approach , 2009, ACM Multimedia.

[3]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[4]  Jakob Nielsen,et al.  Advances in human-computer interaction (vol. 5) , 1995 .

[5]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Simon King,et al.  IEEE Workshop on automatic speech recognition and understanding , 2009 .

[7]  Mark Wells,et al.  Tessa, a system to aid communication with deaf people , 2002, ASSETS.

[8]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[10]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[11]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[12]  J Boyd,et al.  Captioned television for the deaf. , 1972, American annals of the deaf.

[13]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[14]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.

[15]  Larry L. Peterson,et al.  Reasoning about naming systems , 1993, TOPL.

[16]  Meng Wang,et al.  Accessible image search , 2009, MM '09.

[17]  Quan Pan,et al.  Real-time multiple objects tracking with occlusion handling in dynamic scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  K. Dobkins,et al.  Visual contrast sensitivity in deaf versus hearing populations: exploring the perceptual consequences of auditory deprivation and experience with a visual language. , 2001, Brain research. Cognitive brain research.

[19]  Stephen R. Gulliver,et al.  How level and type of deafness affect user perception of multimedia video clips , 2003, Universal Access in the Information Society.

[20]  Meng Wang,et al.  Video Content Structuring , 2009, Scholarpedia.

[21]  George Ghinea,et al.  QoS impact on user perception and understanding of multimedia video clips , 1998, MULTIMEDIA '98.

[22]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  HongJiang Zhang,et al.  Contrast-based image attention analysis by using fuzzy growing , 2003, MULTIMEDIA '03.

[24]  Walter Daelemans,et al.  Tabtalk: reusability in data-oriented grapheme-to-phoneme conversion , 1993, EUROSPEECH.

[25]  Ling-Yu Duan,et al.  Hierarchical movie affective content analysis based on arousal and valence features , 2008, ACM Multimedia.

[26]  Barbara B. Braverman,et al.  The Effects of Caption Rate and Language Level on Comprehension of a Captioned Video Presentation , 1980, American annals of the deaf.

[27]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[28]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  P. Laukka,et al.  Communication of emotions in vocal expression and music performance: different channels, same code? , 2003, Psychological bulletin.

[30]  Affective content detection in sitcom using subtitle and audio , 2006, 2006 12th International Multi-Media Modelling Conference.

[31]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[32]  D. Jackson,et al.  Television literacy: comprehension of program content using closed captions for the deaf. , 2001, Journal of deaf studies and deaf education.

[33]  Timothy J. Hazen Automatic alignment and error correction of human generated transcripts for long speech recordings , 2006, INTERSPEECH.

[34]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[35]  Stephen R. Gulliver,et al.  Impact of captions on deaf and hearing perception of multimedia video clips , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[36]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[37]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[38]  Silvia Pfleger,et al.  Advances in Human-Computer Interaction , 1995, Research Reports Esprit.

[39]  W Garrison,et al.  Working memory capacity and comprehension processes in deaf readers. , 1997, Journal of deaf studies and deaf education.