Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task

We describe how paperXML, a logical document structure markup for scholarly articles, is generated on the basis of OCR tool outputs. PaperXML has been initially developed for the ACL Anthology Searchbench. The main purpose was to robustly provide uniform access to sentences in ACL Anthology papers from the past 46 years, ranging from scanned, typewriter-written conference and workshop proceedings papers, up to recent high-quality typeset, born-digital journal articles, with varying layouts. PaperXML markup includes information on page and paragraph breaks, section headings, footnotes, tables, captions, boldface and italics character styles as well as bibliographic and publication metadata. The role of paperXML in the ACL Contributed Task Rediscovering 50 Years of Discoveries is to serve as fall-back source (1) for older, scanned papers (mostly published before the year 2000), for which born-digital PDF sources are not available, (2) for born-digital PDF papers on which the PDFExtract method failed, (3) for document parts where PDFExtract does not output useful markup such as currently for tables. We sketch transformation of paperXML into the ACL Contributed Task's TEI P5 XML.