Deep-Sync: A novel deep learning-based tool for semantic-aware subtitling synchronisation

Subtitles are a key element to make any media content accessible for people who suffer from hearing impairment and for elderly people, but also useful when watching TV in a noisy environment or learning new languages. Most of the time, subtitles are generated manually in advance, building a verbatim and synchronised transcription of the audio. However, in TV live broadcasts, captions are created in real time by a re-speaker with the help of a voice recognition software, which inevitability leads to delays and lack of synchronisation. In this paper, we present Deep-Sync, a tool for the alignment of subtitles with the audio-visual content. The architecture integrates a deep language representation model and a real-time voice recognition software to build a semantic-aware alignment tool that successfully aligns most of the subtitles even when there is no direct correspondence between the re-speaker and the audio content. In order to avoid any kind of censorship, Deep-Sync can be deployed directly on users’ TVs causing a small delay to perform the alignment, but avoiding to delay the signal at the broadcaster station. Deep-Sync was compared with other subtitles alignment tool, showing that our proposal is able to improve the synchronisation in all tested cases.

[1]  Hiroyuki Segi,et al.  Simultaneous Subtitling System for Broadcast News Programs with a Speech Recognizer(Special Issue on the 2001 IEICE Excellent Paper Award) , 2003 .

[2]  Belén Ruíz-Mezcua,et al.  A new system for automatic analysis and quality adjustment in audiovisual subtitled‐based contents by means of genetic algorithms , 2020, Expert Syst. J. Knowl. Eng..

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Marijan Herceg,et al.  Application for Testing of Video and Subtitle Synchronization , 2018, 2018 International Conference on Smart Systems and Technologies (SST).

[5]  B. Ruiz-Mezcua,et al.  Sub-Sync: Automatic Synchronization of Subtitles in the Broadcasting of True Live programs in Spanish , 2019, IEEE Access.

[6]  Yves Gambier,et al.  Introduction: Screen Transadaptation: Perception and Reception , 2003 .

[7]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[8]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[9]  Mark J. F. Gales Adaptive training for robust ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[10]  Jeremy Howard,et al.  fastai: A Layered API for Deep Learning , 2020, Inf..

[11]  Atiwong Suchato,et al.  Real-time synchronization of live speech with its transcription , 2013, 2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[12]  Juan González,et al.  Web-based Platform for Subtitles Customization and Synchronization in Multi-Screen Scenarios , 2017, TVX.

[13]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[14]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[15]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[16]  Alfonso Ortega,et al.  Audio and text synchronization for TV news subtitling based on Automatic Speech Recognition , 2009, 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting.

[17]  Y. Gambier Introduction , 2003 .

[18]  Sona R. Pawara,et al.  Instant Bi-Lingual Captions , 2019, 2019 IEEE 5th International Conference for Convergence in Technology (I2CT).

[19]  Celso A. S. Santos,et al.  Investigating the Influence of Subtitles Synchronization in the Viewer's Quality of Experience , 2018, IHC.

[20]  Matthias Sperber,et al.  Low Latency ASR for Simultaneous Speech Translation , 2020, ArXiv.

[21]  Hans Stokking,et al.  Standards for multi-stream and multi-device media synchronization , 2016, IEEE Communications Magazine.

[22]  Oskar Olofsson Detecting Unsynchronized Audio and Subtitles using Machine Learning , 2019 .

[23]  M. Jerez,et al.  La incorporación de la realidad profesional a la formación de intérpretes de conferencias mediante las nuevas tecnologías y la investigación-acción , 2011 .

[24]  Yonghong Yan,et al.  Simultaneous Synchronization of Text and Speech for Broadcast News Subtitling , 2009, ISNN.

[25]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[26]  Yifan Gong,et al.  Fundamentals of speech recognition , 2015 .

[27]  V. Robles-Bykbaev,et al.  An interactive system to automatically generate video summaries and perform subtitles synchronization for persons with hearing loss , 2018, 2018 IEEE XXV International Conference on Electronics, Electrical Engineering and Computing (INTERCON).

[28]  Shinji Watanabe,et al.  Promising Accurate Prefix Boosting for Sequence-to-sequence ASR , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).