论文信息 - A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.

Edward A. Fox | William A. Ingram | Jian Wu | Muntabir Hasan Choudhury

[1] Jöran Beel,et al. Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[2] Patrice Lopez,et al. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[3] Edward A. Fox,et al. Electronic Theses and Dissertations: Progress, Issues, and Prospects , 2009 .

[4] Edward A. Fox,et al. Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[5] Xiao Yang,et al. Text extraction and retrieval from smartphone screenshots: building a repository for life in media , 2018, SAC.