U-Net Based Architectures for Document Text Detection and Binarization

With the increasing popularity of document analysis and recognition systems, text detection (TD) and text binarization (TB) in document images remain challenging tasks. In the paper, we introduced a two-step architecture for the TD task. Firstly, a U-net based model is used to get a text mask in terms of word-level bounding boxes. Secondly, we approximate the mask of the bounding boxes with rectangles using a classic computer vision method. The model achieves state-of-the-art result on document images and outperforms other popular approaches. Moreover, we introduce the Hybrid U-net architecture, which helps to solve the TB and TD problems at the same time. The model demonstrates high results on both problems. The shared convolution encoder allows to reduce the number of parameters and consumed memory compared to separate models without reducing the model performance.

[1]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Shuchang Zhou,et al.  EAST: An Efficient and Accurate Scene Text Detector , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[7]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[8]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Luc Van Gool,et al.  Efficient Non-Maximum Suppression , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[11]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[12]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[13]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ray Smith An Overview of the Tesseract OCR Engine , 2007 .

[15]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Shijian Lu,et al.  Text Flow: A Unified Text Detection System in Natural Scene Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Weilin Huang,et al.  Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[19]  H. Heijmans Morphological image operators , 1994 .

[20]  Jiri Matas,et al.  FASText: Efficient Unconstrained Scene Text Detector , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Sébastien Ourselin,et al.  Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations , 2017, DLMIA/ML-CDS@MICCAI.

[22]  Herbert Freeman,et al.  Determining the minimum-area encasing rectangle for an arbitrary closed curve , 1975, CACM.

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.