Multi-modal page stream segmentation with convolutional neural networks

In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As first steps, the workflow usually involves batch scanning and optical character recognition (OCR) of documents. In the case of multi-page documents, the preservation of document contexts is a major requirement. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into coherent multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach for PSS based on convolutional neural networks (CNN). As a first project, we combine visual information from scanned images with semantic information from OCR-ed texts for this task. The multi-modal combination of features in a single classification architecture allows for major improvements towards optimal document separation. Further to multimodality, our PSS approach profits from transfer-learning and sequential page modeling. We achieve accuracy up to 95% on multi-page documents on our in-house dataset and up to 93% on a publicly available dataset.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Patrick Jähnichen,et al.  Matching Results of Latent Dirichlet Allocation for Text , 2012 .

[3]  Abdel Belaïd,et al.  Segmentation of continuous document flow by a modified backward-forward algorithm , 2009, Electronic Imaging.

[4]  Mickaël Coustaty,et al.  Feature Selection for Document Flow Segmentation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[5]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[6]  Albert Gordo,et al.  Document Classification and Page Stream Segmentation for Digital Mailroom Applications , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[7]  Volkmar Frinken,et al.  Multimodal page classification in administrative document image streams , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Abdel Belaïd,et al.  Multipage Administrative Document Stream Segmentation , 2014, 2014 22nd International Conference on Pattern Recognition.

[9]  Christian Biemann,et al.  Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter , 2018, ArXiv.

[10]  Abdel Belaïd,et al.  Combination of Structural and Factual Descriptors for Document Stream Segmentation , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[11]  Ignazio Gallo,et al.  Embedded Textual Content for Document Image Classification with Convolutional Neural Networks , 2016, DocEng.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[22]  Jayant Kumar,et al.  Structural similarity for document image classification and retrieval , 2014, Pattern Recognit. Lett..

[23]  Cagdas Ulas,et al.  An approach to the segmentation of multi-page document flow using binary classification , 2015, International Conference on Graphic and Image Processing.

[24]  Abdel Belaïd,et al.  Document flow segmentation for business applications , 2013, Electronic Imaging.

[25]  Ignazio Gallo,et al.  Deep Neural Networks for Page Stream Segmentation and Classification , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[26]  Mickaël Coustaty,et al.  Machine Learning vs Deterministic Rule-Based System for Document Stream Segmentation , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[27]  Gregor Wiedemann,et al.  Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning , 2019 .

[28]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[29]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .