Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive11httns://tinvurl.com/y8kxzwrp and a GitHub repository22https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf, respectively.

[1]  Muntabir Hasan Choudhury,et al.  A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations , 2020, JCDL.

[2]  Chiman Kwan,et al.  A Comparative Study of Sequence Tagging Methods for Domain Knowledge Entity Recognition in Biomedical Papers , 2020, JCDL.

[3]  Kazem Taghva,et al.  Aligning Ground Truth Text with OCR Degraded Text , 2019, Advances in Intelligent Systems and Computing.

[4]  C. Lee Giles,et al.  HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[5]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[6]  Xiaoli Li,et al.  Keyphrase Extraction using Sequential Labeling , 2016, ArXiv.

[7]  Dominika Tkaczyk,et al.  CERMINE: automatic extraction of structured metadata from scientific literature , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[9]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[10]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[11]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Kalina Bontcheva,et al.  Developing Language Processing Components with GATE Version 5 (a User Guide) , 2010 .

[14]  H. Cunningham,et al.  Developing Language Processing Components with GATE , 2001 .