WEATHERGOV+: A Table Recognition and Summarization Dataset to Bridge the Gap Between Document Image Analysis and Natural Language Generation

Tables, ubiquitous in data-oriented documents like scientific papers and financial statements, organize and convey relational information. Automatic table recognition from document images, which involves detection within the page, structural segmentation into rows, columns, and cells, and information extraction from cells, has been a popular research topic in document image analysis (DIA). With recent advances in natural language generation (NLG) based on deep neural networks, data-to-text generation, in particular for table summarization, offers interesting solutions to time-intensive data analysis. In this paper, we aim to bridge the gap between efforts in DIA and NLG regarding tabular data: we propose WEATHERGOV+, a dataset building upon the WEATHERGOV dataset, the standard for tabular data summarization techniques, that allows for the training and testing of end-to-end methods working from input document images to generate text summaries as output. WEATHERGOV+ contains images of tables created from the tabular data of WEATHERGOV using visual variations that cover various levels of difficulty, along with the corresponding human-generated table summaries of WEATHERGOV. We also propose an end-to-end pipeline that compares state-of-the-art table recognition methods for summarization purposes. We analyse the results of the proposed pipeline by evaluating WEATHERGOV+ at each stage of the pipeline to identify the effects of error propagation and the weaknesses of the current methods, such as OCR errors. With this research (dataset and code available here1), we hope to encourage new research for the processing and management of inter- and intra-document collections.

[1]  Wided Souidène Mseddi,et al.  DCTable: A Dilated CNN with Optimizing Anchors for Accurate Table Detection , 2023, Journal of Imaging.

[2]  Fan Yang,et al.  A large-scale dataset for end-to-end table recognition in the wild , 2023, Scientific Data.

[3]  Shuaiqi Liu,et al.  Long Text and Multi-Table Summarization: Dataset and Method , 2023, EMNLP.

[4]  Wayne Xin Zhao,et al.  TextBox 2.0: A Text Generation Library with Pre-trained Language Models , 2022, EMNLP.

[5]  A. Shigarov Table understanding: Problem overview , 2022, WIREs Data Mining Knowl. Discov..

[6]  Mayank Singh,et al.  Tables to LaTeX: structure and content extraction from scientific tables , 2022, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Wayne Xin Zhao,et al.  MVP: Multi-task Supervised Pre-training for Natural Language Generation , 2022, ACL.

[8]  Qiang Huo,et al.  Robust Table Detection and Structure Recognition from Heterogeneous Document Images , 2022, Pattern Recognit..

[9]  N. Cho,et al.  Deep-learning and graph-based approach to table structure recognition , 2021, Multim. Tools Appl..

[10]  Robin Abraham,et al.  PubTables-1M: Towards comprehensive table extraction from unstructured documents , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Didier Stricker,et al.  CasTabDetectoRS: Cascade Network for Table Detection in Document Images with Recursive Feature Pyramid and Switchable Atrous Convolution , 2021, J. Imaging.

[12]  Antonio Jimeno-Yepes,et al.  ICDAR 2021 Competition on Scientific Literature Parsing , 2021, ICDAR.

[13]  Mayank Singh,et al.  ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX , 2021, ICDAR.

[14]  Alexander Mehler,et al.  Multi-Type-TD-TSR - Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: from OCR to Structured Table Representations , 2021, KI.

[15]  Fei Wu,et al.  LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment , 2021, ICDAR.

[16]  Peng Gao,et al.  PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML , 2021, ArXiv.

[17]  Didier Stricker,et al.  Current Status and Performance Analysis of Table Recognition in Document Images With Deep Neural Networks , 2021, IEEE Access.

[18]  Mirella Lapata,et al.  Data-to-text Generation with Macro Planning , 2021, Transactions of the Association for Computational Linguistics.

[19]  Laure Soulier,et al.  Controlling hallucinations at word level in data-to-text generation , 2021, Data Mining and Knowledge Discovery.

[20]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[22]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2019, ECCV.

[23]  Xianbiao Qi,et al.  MASTER: Multi-Aspect Non-local Network for Scene Text Recognition , 2019, Pattern Recognit..

[24]  Shoaib Ahmed Siddiqui,et al.  DeepTabStR: Deep Learning based Table Structure Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[25]  Yu Fang,et al.  ICDAR 2019 Competition on Table Detection and Recognition (cTDaR) , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[26]  David S. Rosenberg,et al.  Challenges in End-to-End Neural Scientific Table Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[27]  Heyan Huang,et al.  Complicated Table Structure Recognition , 2019, ArXiv.

[28]  Mirella Lapata,et al.  Data-to-text Generation with Entity Modeling , 2019, ACL.

[29]  Faisal Shafait,et al.  Rethinking Table Recognition using Graph Neural Networks , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[30]  Jie Sheng,et al.  Pyramid Mask Text Detector , 2019, ArXiv.

[31]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[32]  Andreas Dengel,et al.  DeCNT: Deep Deformable CNN for Table Detection , 2018, IEEE Access.

[33]  Mirella Lapata,et al.  Data-to-Text Generation with Content Selection and Planning , 2018, AAAI.

[34]  Xiang Li,et al.  Shape Robust Text Detection With Progressive Scale Expansion Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Mitesh M. Khapra,et al.  A Mixed Hierarchical Attention Based Encoder-Decoder Approach for Standard Table Summarization , 2018, NAACL.

[36]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[37]  Pascal Poupart,et al.  Order-Planning Neural Text Generation From Structured Data , 2017, AAAI.

[38]  Emiel Krahmer,et al.  PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences , 2017, INLG.

[39]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[42]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[43]  Matthew R. Walter,et al.  What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment , 2015, NAACL.

[44]  Mirella Lapata,et al.  A Global Model for Concept-to-Text Generation , 2013, J. Artif. Intell. Res..

[45]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[46]  Ying Liu,et al.  Dataset, Ground-Truth and Performance Metrics for Table Detection Evaluation , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[47]  Thomas Kieninger,et al.  An open approach towards the benchmarking of table structure recognition systems , 2010, DAS '10.

[48]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[49]  K. Selçuk Candan,et al.  AlphaSum: size-constrained table summarization using value lattices , 2009, EDBT '09.

[50]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[51]  Raymond J. Mooney,et al.  Generation by Inverting a Semantic Parser that Uses Statistical Machine Translation , 2007, NAACL.

[52]  Mirella Lapata,et al.  Collective Content Selection for Concept-to-Text Generation , 2005, HLT.

[53]  Jim Hunter,et al.  Choosing words in computer-generated weather forecasts , 2005, Artif. Intell..

[54]  Kathleen McKeown,et al.  Statistical Acquisition of Content Selection Rules for Natural Language Generation , 2003, EMNLP.

[55]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[56]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[57]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[58]  R. Zimdahl in and Other , 2020, Agricultural Ethics - An Invitation.

[59]  Aurélie Lemaitre,et al.  Recognition of Tables and Forms , 2014, Handbook of Document Image Processing and Recognition.