Beyond Document Page Classification: Design, Datasets, and Challenges

This paper highlights the need to bring document classification benchmarking closer to real-world applications, both in the nature of data tested ($X$: multi-channel, multi-paged, multi-industry; $Y$: class distributions and label set variety) and in classification tasks considered ($f$: multi-page document, page stream, and document bundle classification, ...). We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations. An experimental study on proposed multi-page document classification datasets demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate complete documents, as they naturally occur in practice. This reality check also calls for more mature evaluation methodologies, covering calibration evaluation, inference complexity (time-memory), and a range of realistic distribution shifts (e.g., born-digital vs. scanning noise, shifting page order). Our study ends on a hopeful note by recommending concrete avenues for future improvements.}

[1]  Kevin Leach,et al.  On Evaluation of Document Classification using RVL-CDIP , 2023, ArXiv.

[2]  Matthew B. Blaschko,et al.  Document Understanding Dataset and Evaluation (DUDE) , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  M. Turski,et al.  CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data , 2023, ICDAR.

[4]  Mickaël Coustaty,et al.  DocILE Benchmark for Document Information Localization and Extraction , 2023, ICDAR.

[5]  Ernest Valveny,et al.  Hierarchical multimodal transformers for Multi-Page DocVQA , 2022, Pattern Recognit..

[6]  Mohit Bansal,et al.  Unifying Vision, Text, and Layout for Universal Document Processing , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lukas Klein,et al.  A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification , 2022, ICLR.

[8]  Kevin Leach,et al.  Evaluating Out-of-Distribution Performance on Document Image Classifiers , 2022, NeurIPS.

[9]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ICML.

[10]  Nisarg Mehta,et al.  Tab this folder of documents: page stream segmentation of business documents , 2022, ACM Symposium on Document Engineering.

[11]  Kok Wei Chee,et al.  Augraphy: A Data Augmentation Library for Document Images , 2022, ICDAR.

[12]  Fuli Feng,et al.  Towards Complex Document Understanding By Discrete Reasoning , 2022, ACM Multimedia.

[13]  Stepán Simsa,et al.  Business Document Information Extraction: Towards Practical Benchmarks , 2022, CLEF.

[14]  B. Pfitzmann,et al.  DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation , 2022, KDD.

[15]  Mickaël Coustaty,et al.  VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification , 2022, Pattern Recognit..

[16]  Vlad I. Morariu,et al.  Unified Pretraining Framework for Document Understanding , 2022, NeurIPS.

[17]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[18]  Furu Wei,et al.  DiT: Self-supervised Pre-training for Document Image Transformer , 2022, ACM Multimedia.

[19]  Ali Furkan Biten,et al.  OCR-IDL: OCR Annotations for Industry Document Library Dataset , 2022, ECCV Workshops.

[20]  David Sánchez,et al.  The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization , 2022, Computational Linguistics.

[21]  Umapada Pal,et al.  DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis , 2021, ICDAR.

[22]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Florian Matthes,et al.  Anonymization of german legal court rulings , 2021, ICAIL.

[24]  Constantin Spille,et al.  Key Information Extraction From Documents: Evaluation And Generator , 2021, DeepOntoNLP/X-SENTIMENT@ESWC.

[25]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  P. Biecek,et al.  Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts , 2021, ICDAR.

[27]  Ernest Valveny,et al.  InfographicVQA , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[28]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[29]  Seong Joon Oh,et al.  Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shashank Mujumdar,et al.  Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[31]  C. V. Jawahar,et al.  Document Visual Question Answering Challenge 2020 , 2020, ArXiv.

[32]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[33]  Furu Wei,et al.  DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.

[34]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2019, ECCV.

[36]  Gerhard Heyer,et al.  Multi-modal page stream segmentation with convolutional neural networks , 2019, Lang. Resour. Evaluation.

[37]  Zheng Huang,et al.  ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[38]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[39]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[40]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[43]  Siddharth Garimella,et al.  Identification of Receipts in a Multi-receipt Image using Spectral Clustering , 2016 .

[44]  Ignazio Gallo,et al.  Deep Neural Networks for Page Stream Segmentation and Classification , 2016, 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[45]  Arnaud Chevallier,et al.  Strategic Thinking in Complex Problem Solving , 2016 .

[46]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[47]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[48]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[49]  Jayant Kumar,et al.  Structural similarity for document image classification and retrieval , 2014, Pattern Recognit. Lett..

[50]  David S. Doermann,et al.  Unsupervised Classification of Structurally Similar Document Images , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[51]  Albert Gordo,et al.  Document Classification and Page Stream Segmentation for Digital Mailroom Applications , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[52]  Albert Gordo,et al.  A Bag-of-Pages Approach to Unordered Multi-page Document Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[53]  D. Doermann,et al.  Automatic Document Logo Detection , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[54]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[55]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[56]  George R Thoma,et al.  Image informatics at a national research center. , 2005, Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society.

[57]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[58]  Jordy Van Landeghem,et al.  ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE) , 2023, IEEE International Conference on Document Analysis and Recognition.

[59]  Seunghyun Park,et al.  Donut: Document Understanding Transformer without OCR , 2021, ArXiv.

[60]  Christophe Garcia,et al.  Data-Efficient Information Extraction from Documents with Pre-trained Language Models , 2021, ICDAR Workshops.