论文信息 - Self-Supervised Representation Learning on Document Images

Self-Supervised Representation Learning on Document Images

This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training, on document image classification scenarios with a limited amount of data.

[1] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2] Marcus Liwicki,et al. A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[3] Christian Thiel,et al. Classification on Soft Labels Is Robust against Label Noise , 2008, KES.

[4] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[5] R. Smith,et al. An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6] Yi Li,et al. Convolutional Neural Networks for Document Image Classification , 2014, 2014 22nd International Conference on Pattern Recognition.

[7] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9] Georgios C. Anagnostopoulos,et al. Knowledge-Based Intelligent Information and Engineering Systems , 2003, Lecture Notes in Computer Science.

[10] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Deborah Silver,et al. Feature Visualization , 1994, Scientific Visualization.

[12] Ming Zhou,et al. TableBank: A Benchmark Dataset for Table Detection and Recognition , 2019 .

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Kan Chen,et al. Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[15] Konstantinos G. Derpanis,et al. Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[16] Alexander Kolesnikov,et al. Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] C. V. Jawahar,et al. Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Ersin Yumer,et al. Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[21] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22] Nicolas Audebert,et al. Multimodal deep networks for text and image-based document classification , 2019, PKDD/ECML Workshops.

[23] Alexei A. Efros,et al. What makes ImageNet good for transfer learning? , 2016, ArXiv.

[24] Anoop Cherian,et al. DeepPermNet: Visual Permutation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jayant Kumar,et al. Structural similarity for document image classification and retrieval , 2014, Pattern Recognit. Lett..

[26] Zhoujun Li,et al. TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[27] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[28] Paolo Favaro,et al. Boosting Self-Supervised Learning via Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Marcus Liwicki,et al. Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[30] Frédéric Kaplan,et al. dhSegment: A Generic Deep-Learning Approach for Document Segmentation , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[31] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[32] Julien Mairal,et al. Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features , 2019, ArXiv.

[33] Yingli Tian,et al. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Andrew Zisserman,et al. Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Chris Tensmeyer,et al. Analysis of Convolutional Neural Networks for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[36] Hui Xiong,et al. A Comprehensive Survey on Transfer Learning , 2021, Proceedings of the IEEE.

[37] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[38] Marcus Liwicki,et al. Real-Time Document Image Classification Using Deep CNN and Extreme Learning Machines , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[39] Quoc V. Le,et al. Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Martin Holecek,et al. Line-items and table understanding in structured documents , 2019, ArXiv.

[41] Gabriela Csurka,et al. What is the right way to represent document images? , 2016, ArXiv.

[42] Sergey Levine,et al. Unsupervised Learning via Meta-Learning , 2018, ICLR.

[43] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[44] Chao Yang,et al. A Survey on Deep Transfer Learning , 2018, ICANN.

[45] C. V. Jawahar,et al. TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces , 2018, ArXiv.

[46] Xiaojing Liu,et al. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.

[47] Faisal Shafait,et al. Rethinking Table Parsing using Graph Neural Networks , 2019, ArXiv.

[48] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[49] Armand Joulin,et al. Unsupervised Learning by Predicting Noise , 2017, ICML.

[50] Steffen Bickel,et al. Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[51] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.