BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in understanding table structures, but they require large amounts of annotated data. However, the availability of annotated datasets to train these methods are expensive, laborious, and very limited. Moreover, human-annotated data suffers from inconsistencies in table and cell annotations. We propose BUDDI Table Factory (BTF) for synthetically generating annotated documents with a wide range of variations in table structures. We propose a heuristics-based method to generate a variety of table structures from which we generate synthetic documents using LaTeX. We propose a computer vision-based approach to localize table and cell regions and automatically generate annotations in PASCAL VOC challenge format. We empirically illustrate the advantage of adding synthetic BTF documents with limited original documents to the model training, which can significantly improve the TEDS and IoU performance of the table structure recognition tasks in public and real-world healthcare datasets.

[1]  Anil Goyal,et al.  DEXTER: An end-to-end system to extract table contents from electronic medical health documents , 2022, ArXiv.

[2]  Sameena Shah,et al.  Synthetic Document Generator for Annotation-free Layout Recognition , 2021, Pattern Recognit..

[3]  Umapada Pal,et al.  DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis , 2021, ICDAR.

[4]  Rafal A. Angryk,et al.  Multiscale IOU: A Metric for Evaluation of Salient Object Detection with Fine Structures , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[5]  Rolf Ingold,et al.  Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs , 2021, ICDAR.

[6]  C. V. Jawahar,et al.  Table Structure Recognition using Top-Down and Bottom-Up Cues , 2020, ECCV.

[7]  C. V. Jawahar,et al.  CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[8]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2019, ECCV.

[9]  Gregory Sell,et al.  A Synthetic Recipe for OCR , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[10]  Salvatore Tabbone,et al.  Automatic Synthetic Document Image Generation using Generative Adversarial Networks: Application in Mobile-Captured Document Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11]  Lovekesh Vig,et al.  TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[12]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[13]  Abhishek Dutta,et al.  The VIA Annotation Software for Images, Audio and Video , 2019, ACM Multimedia.

[14]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[15]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[16]  Muriel Visani,et al.  DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images , 2017, J. Imaging.

[17]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Thomas Kieninger,et al.  An open approach towards the benchmarking of table structure recognition systems , 2010, DAS '10.

[20]  Bailing Zhang,et al.  Data Synthesis for Document Layout Analysis , 2020, ICWL/SETE.

[21]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.