We describe how paperXML, a logical document structure markup for scholarly articles, is generated on the basis of OCR tool outputs. PaperXML has been initially developed for the ACL Anthology Searchbench. The main purpose was to robustly provide uniform access to sentences in ACL Anthology papers from the past 46 years, ranging from scanned, typewriter-written conference and workshop proceedings papers, up to recent high-quality typeset, born-digital journal articles, with varying layouts. PaperXML markup includes information on page and paragraph breaks, section headings, footnotes, tables, captions, boldface and italics character styles as well as bibliographic and publication metadata. The role of paperXML in the ACL Contributed Task Rediscovering 50 Years of Discoveries is to serve as fall-back source (1) for older, scanned papers (mostly published before the year 2000), for which born-digital PDF sources are not available, (2) for born-digital PDF papers on which the PDFExtract method failed, (3) for document parts where PDFExtract does not output useful markup such as currently for tables. We sketch transformation of paperXML into the ACL Contributed Task's TEI P5 XML.
[1]
Vladimir I. Levenshtein,et al.
Binary codes capable of correcting deletions, insertions, and reversals
,
1965
.
[2]
Dragomir R. Radev,et al.
The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
,
2008,
LREC.
[3]
Ulrich Schäfer,et al.
The ACL Anthology Searchbench
,
2011,
ACL.
[4]
Rico Sennrich,et al.
Reducing OCR errors by combining two OCR systems
,
2010
.
[5]
Stephan Oepen,et al.
Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task
,
2012,
Discoveries@ACL.
[6]
Ulrich Schäfer,et al.
A Graphical Citation Browser for the ACL Anthology
,
2012,
LREC.
[7]
Stephan Oepen,et al.
Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task
,
2012,
Discoveries@ACL.