Joint Subtitle Extraction and Frame Inpainting for Videos with Burned-In Subtitles

Subtitles are crucial for video content understanding. However, a large amount of videos have only burned-in, hardcoded subtitles that prevent video re-editing, translation, etc. In this paper, we construct a deep-learning-based system for the inverse conversion of a burned-in subtitle video to a subtitle file and an inpainted video, by coupling three deep neural networks (CTPN, CRNN, and EdgeConnect). We evaluated the performance of the proposed method and found that the deep learning method achieved high-precision separation of the subtitles and video frames and significantly improved the video inpainting results compared to the existing methods. This research fills a gap in the application of deep learning to burned-in subtitle video reconstruction and is expected to be widely applied in the reconstruction and re-editing of videos with subtitles, advertisements, logos, and other occlusions.

[1]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Alexandru Telea,et al.  An Image Inpainting Technique Based on the Fast Marching Method , 2004, J. Graphics, GPU, & Game Tools.

[3]  Wang Jia-jun Automatic Text Detection and Removal in Video Images , 2008 .

[4]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[5]  Qifeng Liu,et al.  A new approach for text segmentation using a stroke filter , 2008, Signal Process..

[6]  Narendra Ahuja,et al.  Image completion using planar structure guidance , 2014, ACM Trans. Graph..

[7]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[8]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[9]  Eli Shechtman,et al.  Image melding , 2012, ACM Trans. Graph..

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[12]  Hiroshi Ishikawa,et al.  Globally and locally consistent image completion , 2017, ACM Trans. Graph..

[13]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[14]  Mohammad Shorif Uddin,et al.  Image Quality Assessment through FSIM, SSIM, MSE and PSNR—A Comparative Study , 2019, Journal of Computer and Communications.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Zhipeng Jia,et al.  End-to-end subtitle detection and recognition for videos in East Asian languages via CNN ensemble , 2016, Signal Process. Image Commun..

[17]  Mehran Ebrahimi,et al.  EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning , 2019, ArXiv.

[18]  Xin Xu,et al.  End-to-end video subtitle recognition via a deep Residual Neural Network , 2020, Pattern Recognit. Lett..

[19]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[20]  Guillermo Sapiro,et al.  Filling-in by joint interpolation of vector fields and gray levels , 2001, IEEE Trans. Image Process..

[21]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[22]  Dong Liu,et al.  Image Compression With Edge-Based Inpainting , 2007, IEEE Transactions on Circuits and Systems for Video Technology.