ICDAR 2021 Competition on Scientific Literature Parsing

Scientific literature contain important information related to cutting-edge innovations in diverse domains. Advances in natural language processing have been driving the fast development in automated information extraction from scientific literature. However, scientific literature is often available in unstructured PDF format. While PDF is great for preserving basic visual elements, such as characters, lines, shapes, etc., on a canvas for presentation to humans, automatic processing of the PDF format by machines presents many challenges. With over 2.5 trillion PDF documents in existence, these issues are prevalent in many other important application domains as well. A critical challenge for automated information extraction from scientific literature is that documents often contain content that is not in natural language, such as figures and tables. Nevertheless, such content usually illustrates key results, messages, or summarizations of the research. To obtain a comprehensive understanding of scientific literature, the automated system must be able to recognize the layout of the documents and parse the non-natural-language content into a machine readable format. Our ICDAR 2021 Scientific Literature Parsing Competition (ICDAR2021SLP) aims to drive the advances specifically in document understanding. ICDAR2021-SLP leverages the PubLayNet and PubTabNet datasets, which provide hundreds of thousands of training and evaluation examples. In Task A, Document Layout Recognition, submissions with the highest performance combine object detection and specialised solutions for the different categories. In Task B, Table Recognition, top submissions rely on methods to identify table components and post-processing methods to generate the table structure and content. Results from both tasks show an impressive performance and opens the possibility for high performance practical applications.

[1]  Yu Fang,et al.  ICDAR 2019 Competition on Table Detection and Recognition (cTDaR) , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[2]  Apostolos Antonacopoulos,et al.  The ENP image and ground truth dataset of historical newspapers , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[3]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[4]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Ping Gong,et al.  MASTER: Multi-Aspect Non-local Network for Scene Text Recognition , 2019, Pattern Recognit..

[6]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Brian L. Price,et al.  Deep Splitting and Merging for Table Structure Decomposition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[9]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2020, ECCV.

[10]  Peng Gao,et al.  PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML , 2021, ArXiv.

[11]  Constantine Bekas,et al.  Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale , 2018, ERCIM News.

[12]  Apostolos Antonacopoulos,et al.  ICDAR2017 Competition on Recognition of Documents with Complex Layouts - RDCL2017 , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[13]  Olga Radyvonenko,et al.  HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification , 2021, ICDAR.

[14]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[15]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[16]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, ArXiv.

[17]  Peng Gao,et al.  PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Table Image Recognition to Latex , 2021, ArXiv.