Video accessibility enhancement for hearing-impaired users

There are more than 66 million people suffering from hearing impairment and this disability brings them difficulty in video content understanding due to the loss of audio information. If the scripts are available, captioning technology can help them in a certain degree by synchronously illustrating the scripts during the playing of videos. However, we show that the existing captioning techniques are far from satisfactory in assisting the hearing-impaired audience to enjoy videos. In this article, we introduce a scheme to enhance video accessibility using a Dynamic Captioning approach, which explores a rich set of technologies including face detection and recognition, visual saliency analysis, text-speech alignment, etc. Different from the existing methods that are categorized as static captioning, dynamic captioning puts scripts at suitable positions to help the hearing-impaired audience better recognize the speaking characters. In addition, it progressively highlights the scripts word-by-word via aligning them with the speech signal and illustrates the variation of voice volume. In this way, the special audience can better track the scripts and perceive the moods that are conveyed by the variation of volume. We implemented the technology on 20 video clips and conducted an in-depth study with 60 real hearing-impaired users. The results demonstrated the effectiveness and usefulness of the video accessibility enhancement scheme.

[1]  W Garrison,et al.  Working memory capacity and comprehension processes in deaf readers. , 1997, Journal of deaf studies and deaf education.

[2]  Stephen R. Gulliver,et al.  How level and type of deafness affect user perception of multimedia video clips , 2003, Universal Access in the Information Society.

[3]  L. Azzopardi,et al.  PuppyIR : Designing an Open Source Framework for Interactive Information Services for Children , 2009 .

[4]  Ling-Yu Duan,et al.  Hierarchical movie affective content analysis based on arousal and valence features , 2008, ACM Multimedia.

[5]  Stephen R. Gulliver,et al.  Impact of captions on deaf and hearing perception of multimedia video clips , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[6]  Andrew Zisserman,et al.  Automatic face recognition for film character retrieval in feature-length films , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Klaus R. Scherer,et al.  Speech emotion analysis , 2008, Scholarpedia.

[8]  William M. Campbell,et al.  Support vector machines for speaker verification and identification , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[9]  Newton Lee,et al.  ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP) , 2007, CIE.

[10]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  HongJiang Zhang,et al.  Contrast-based image attention analysis by using fuzzy growing , 2003, MULTIMEDIA '03.

[12]  Mei-Yuh Hwang,et al.  An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[13]  Mark Wells,et al.  Tessa, a system to aid communication with deaf people , 2002, ASSETS.

[14]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  D. Jackson,et al.  Television literacy: comprehension of program content using closed captions for the deaf. , 2001, Journal of deaf studies and deaf education.

[16]  Meng Wang,et al.  Video Content Structuring , 2009, Scholarpedia.

[17]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Markel Vigo,et al.  Considering Web Accessibility in Information Retrieval Systems , 2007, ICWE.

[19]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[20]  Walter Daelemans,et al.  Tabtalk: reusability in data-oriented grapheme-to-phoneme conversion , 1993, EUROSPEECH.

[21]  Julio Abascal,et al.  Improving deaf users' accessibility in hypertext information retrieval: are graphical interfaces useful for them? , 2006, Behav. Inf. Technol..

[22]  Meng Wang,et al.  Dynamic captioning: video accessibility enhancement for hearing impairment , 2010, ACM Multimedia.

[23]  Meng Wang,et al.  Accessible image search , 2009, MM '09.

[24]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[25]  J Boyd,et al.  Captioned television for the deaf. , 1972, American annals of the deaf.

[26]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[27]  Affective content detection in sitcom using subtitle and audio , 2006, 2006 12th International Multi-Media Modelling Conference.

[28]  Meng Wang,et al.  Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation , 2009, IEEE Transactions on Multimedia.

[29]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[30]  Pedro J. Moreno,et al.  A recursive algorithm for the forced alignment of very long audio segments , 1998, ICSLP.

[31]  Quan Pan,et al.  Real-time multiple objects tracking with occlusion handling in dynamic scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[33]  Silvia Pfleger,et al.  Advances in Human-Computer Interaction , 1995, Research Reports Esprit.

[34]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[35]  Barbara B. Braverman,et al.  The Effects of Caption Rate and Language Level on Comprehension of a Captioned Video Presentation , 1980, American annals of the deaf.

[36]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[37]  Bernd Freisleben,et al.  Unfolding speaker clustering potential: a biomimetic approach , 2009, ACM Multimedia.

[38]  Shuicheng Yan,et al.  Inferring semantic concepts from community-contributed images and noisy tags , 2009, ACM Multimedia.