论文信息 - Chargrid: Towards Understanding 2D Documents

Chargrid: Towards Understanding 2D Documents

We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images.

[1] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4] Ole Winther,et al. CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[7] Jason Weston,et al. A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[8] Eugenio Culurciello,et al. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation , 2016, ArXiv.

[9] Marcus Liwicki,et al. Page segmentation of historical document images with convolutional autoencoders , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[10] Jonathan Tompson,et al. Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Ronan Collobert,et al. Learning to Refine Object Segments , 2016, ECCV.

[12] Tara N. Sainath,et al. Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14] T. Munich,et al. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[15] Guillaume Lample,et al. Neural Architectures for Named Entity Recognition , 2016, NAACL.

[16] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[18] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[19] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Yaroslav Bulatov,et al. Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , 2013, ICLR.

[21] Ersin Yumer,et al. Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Faisal Shafait,et al. Real-Time Document Localization in Natural Images by Recursive Application of a CNN , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[23] Jitendra Malik,et al. Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[24] Chris Tensmeyer,et al. Document Image Binarization with Fully Convolutional Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[25] Oriol Vinyals,et al. Multilingual Language Processing From Bytes , 2015, NAACL.

[26] Roberto Cipolla,et al. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.