PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation

Metadata extraction from scholarly PDF documents is the fundamental work of publishing, archiving, digital library construction, bibliometrics, and scientific competitiveness analysis and evaluations. However, different scholarly PDF documents have different layout and document elements, which make it impossible to compare different extract approaches since testers use different source of test documents even if the documents are from the same journal or conference. Therefore, standard datasets based performance evaluation of various extraction approaches can setup a fair and reproducible comparison. In this paper we present a dataset, namely, PARDA(Pdf Analysis and Recognition DAtaset), for performance evaluation and analysis of scholarly documents, especially on metadata extraction, such as title, authors, affiliation, author-affiliation-email matching, year, date, etc. The dataset covers computer science, physics, life science, management, mathematics, and humanities from various publishers including ACM, IEEE, Springer, Elsevier, arXiv, etc. And each document has distinct layouts and appearance in terms of formatting of metadata. We also construct the ground truth metadata in Dublin Core XML format and BibTex format file associated this dataset.

[1]  Ernest Valveny Datasets and Annotations for Document Analysis and Recognition , 2014, Handbook of Document Image Processing and Recognition.

[2]  Stephen V. Rice,et al.  Software tools and test data for research and testing of page-reading OCR systems , 2005, IS&T/SPIE Electronic Imaging.

[3]  Lifeng Yu,et al.  Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents , 2018, J. Database Manag..

[4]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[5]  Zhi Tang,et al.  Ground-Truth and Performance Evaluation for Page Layout Analysis of Born-Digital Documents , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[6]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[7]  Dominika Tkaczyk,et al.  GROTOAP: ground truth for open access publications , 2012, JCDL '12.

[8]  Min-Yen Kan,et al.  Extracting and matching authors and affiliations in scholarly documents , 2013, JCDL '13.

[9]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[10]  Keith G. Jeffery,et al.  Research information management: the CERIF approach , 2014, Int. J. Metadata Semant. Ontologies.

[11]  Kresimir Duretec,et al.  Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets , 2013, JCDL '13.

[12]  Volker Märgner,et al.  Tools and Metrics for Document Analysis Systems Evaluation , 2014, Handbook of Document Image Processing and Recognition.

[13]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[14]  Dominika Tkaczyk,et al.  GROTOAP2 - The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles , 2014, D Lib Mag..

[15]  Marcel Worring,et al.  The UvA color document dataset , 2004, International Journal of Document Analysis and Recognition (IJDAR).