CineAD: a system for automated audio description script generation for the visually impaired

Audio description (AD) is an assistive technology that allows visually impaired people to access cinema and understand the story of a movie. Basically, the visual content of the story is told by way of using a voice, narrated during the film gaps of silence. Nonetheless, this assistive technology is not widely used, due to several factors, among them the high cost and time involved in creating such audio descriptions. Towards solving this problem, this work proposes a solution that automatically generates AD scripts for recorded audiovisual content, named CineAD. This solution detects the breaks in the spoken lines in the video receiving the AD and generates these descriptions from the original script and subtitles. Alternatively, the solution can be incorporated into a speech synthesizer or used by an audio description narrator to generate the audio that contains the descriptions. To evaluate the proposed solution, qualitative tests with visually impaired users and audio description narrators are conducted. The results show that the proposed solution can generate descriptions of the most important events in the videos, and therefore, can help to reduce the barriers in accessing video faced by visually impaired, when the script and subtitles are available.

[1]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[2]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Anna Fernández-Torné Audio description and technologies. Study on the semi-automatisation of the translation and voicing of audio descriptions , 2016 .

[6]  Antonio García,et al.  Paper title: Design, Development and Field Evaluation of a Spanish into Sign Language Translation System , 2009 .

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Kun-Ching Wang,et al.  Speech/music discrimination using hybrid-based feature extraction for audio data indexing , 2017, 2017 International Conference on System Science and Engineering (ICSSE).

[9]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[10]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[11]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[12]  Yannick Prié,et al.  Towards the usage of pauses in audio-described videos , 2013, W4A.

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hironobu Takagi,et al.  Are synthesized video descriptions acceptable? , 2010, ASSETS '10.

[15]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[18]  Jill Whitehead,et al.  What is audio description , 2005 .

[19]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[20]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[22]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[23]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sara Morrissey,et al.  Data-Driven Machine Translation for Sign Languages , 2008 .

[25]  Tiago Maritan Ugulino de Araújo,et al.  An approach to generate and embed sign language video tracks into multimedia contents , 2014, Inf. Sci..

[26]  B. Taskar,et al.  Learning from ambiguously labeled images , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  J. Lakritz,et al.  The Semi-Automatic Generation of Audio Description from Screenplays , 2006 .

[29]  Langis Gagnon,et al.  Accessible videodescription On-Demand , 2009, Assets '09.

[30]  Khurshid Ahmad,et al.  What happens in films? , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[31]  Hironobu Takagi,et al.  Describing online videos with text-to-speech narration , 2010, W4A.

[32]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.