Tab this folder of documents: page stream segmentation of business documents

In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.

[1]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[2]  Gerhard Heyer,et al.  Multi-modal page stream segmentation with convolutional neural networks , 2019, Lang. Resour. Evaluation.

[3]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[4]  Mickaël Coustaty,et al.  Feature Selection for Document Flow Segmentation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[5]  Mickaël Coustaty,et al.  Machine Learning vs Deterministic Rule-Based System for Document Stream Segmentation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[6]  Marcus Liwicki,et al.  Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[7]  Abdel Belaïd,et al.  Combination of Structural and Factual Descriptors for Document Stream Segmentation , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cagdas Ulas,et al.  An approach to the segmentation of multi-page document flow using binary classification , 2015, International Conference on Graphic and Image Processing.

[10]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Abdel Belaïd,et al.  Multipage Administrative Document Stream Segmentation , 2014, 2014 22nd International Conference on Pattern Recognition.

[12]  Abdel Belaïd,et al.  Document flow segmentation for business applications , 2013, Electronic Imaging.

[13]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[14]  Mohammad Zubair Khan,et al.  A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain , 2022, IEEE Access.