Self-Supervised Representation Learning on Document Images

This work analyses the impact of self-supervised pre-training on document images in the context of document image classification. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training, on document image classification scenarios with a limited amount of data.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Marcus Liwicki,et al.  A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[3]  Christian Thiel,et al.  Classification on Soft Labels Is Robust against Label Noise , 2008, KES.

[4]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[5]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Yi Li,et al.  Convolutional Neural Networks for Document Image Classification , 2014, 2014 22nd International Conference on Pattern Recognition.

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Georgios C. Anagnostopoulos,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2003, Lecture Notes in Computer Science.

[10]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[12]  Ming Zhou,et al.  TableBank: A Benchmark Dataset for Table Detection and Recognition , 2019 .

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[15]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[16]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  C. V. Jawahar,et al.  Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Nicolas Audebert,et al.  Multimodal deep networks for text and image-based document classification , 2019, PKDD/ECML Workshops.

[23]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[24]  Anoop Cherian,et al.  DeepPermNet: Visual Permutation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jayant Kumar,et al.  Structural similarity for document image classification and retrieval , 2014, Pattern Recognit. Lett..

[26]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[27]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[28]  Paolo Favaro,et al.  Boosting Self-Supervised Learning via Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Marcus Liwicki,et al.  Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[30]  Frédéric Kaplan,et al.  dhSegment: A Generic Deep-Learning Approach for Document Segmentation , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Julien Mairal,et al.  Leveraging Large-Scale Uncurated Data for Unsupervised Pre-training of Visual Features , 2019, ArXiv.

[33]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Chris Tensmeyer,et al.  Analysis of Convolutional Neural Networks for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[36]  Hui Xiong,et al.  A Comprehensive Survey on Transfer Learning , 2021, Proceedings of the IEEE.

[37]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[38]  Marcus Liwicki,et al.  Real-Time Document Image Classification Using Deep CNN and Extreme Learning Machines , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[39]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Martin Holecek,et al.  Line-items and table understanding in structured documents , 2019, ArXiv.

[41]  Gabriela Csurka,et al.  What is the right way to represent document images? , 2016, ArXiv.

[42]  Sergey Levine,et al.  Unsupervised Learning via Meta-Learning , 2018, ICLR.

[43]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[44]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[45]  C. V. Jawahar,et al.  TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces , 2018, ArXiv.

[46]  Xiaojing Liu,et al.  Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.

[47]  Faisal Shafait,et al.  Rethinking Table Parsing using Graph Neural Networks , 2019, ArXiv.

[48]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[49]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[50]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[51]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.