论文信息 - Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task

Towards High-Quality Text Stream Extraction from PDF. Technical Background to the ACL 2012 Contributed Task

Extracting textual content and document structure from PDF presents a surprisingly (depressingly, to some, in fact) difficult challenge, owing to the purely display-oriented design of the PDF document standard. While a variety of lower-level PDF extraction toolkits exist, none fully support the recovery of original text (in reading order) and relevant structural elements, even for so-called borndigital PDFs, i.e. those prepared electronically using typesetting systems like LATEX, OpenOffice, and the like. This short paper summarizes a new tool for high-quality extraction of text and structure from PDFs, combining state-of-the-art PDF parsing, font interpretation, layout analysis, and TEI-compliant output of text and logical document markup.

[1] Thomas M. Breuel,et al. Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[2] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[3] Øyvind Raddum Berg. High precision text extraction from PDF documents , 2011 .

[4] Stefano Messelodi,et al. Geometric Layout Analysis Techniques for Document Image Understanding: a Review , 2008 .

[5] T. Breuel. Layout Analysis based on Text Line Segment Hypotheses , 2003 .

[6] Stephan Oepen,et al. Towards an ACL Anthology Corpus with Logical Document Structure. An Overview of the ACL 2012 Contributed Task , 2012, Discoveries@ACL.

[7] Ulrich Schäfer,et al. Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task , 2012, Discoveries@ACL.