Towards Reverse Engineering of PDF Documents

We present a progress report on our ongoing project of reverse engineering scientific PDF documents. The aim is to obtain mathematical markup that can be used as source for regenerating a document that resembles the original as closely as possible. This source can then be a basis for further document processing. Our current tool uses specialised PDF extraction together with image analysis to produce near perfect input for parsing mathematical formula. Applying a linear grammar and specific drivers for each output format to this input, we can produce an accurate reproduction of formulae when presented with their coordinates. In this paper we will show how this information can be exploited to discover the locations of both inline and display formulae, and also to perform rudimentary layout analysis of the whole document, identifying structures such as headings and paragraphs.

[1]  Utpal Garain,et al.  Identification of Mathematical Expressions in Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Robert H. Anderson Syntax-directed recognition of hand-printed two-dimensional mathematics , 1967, Symposium on Interactive Systems for Experimental Applied Mathematics.

[3]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[4]  昌和 鈴木,et al.  A Ground-Truthed Mathematical Character and Symbol Image Database , 2005 .

[5]  Michel Goossens,et al.  The LaTeX companion , 1993 .

[6]  I. P. Natanson,et al.  Theory of Functions of a Real Variable , 1955 .

[7]  Volker Sorge,et al.  Faithful mathematical formula recognition from PDF documents , 2010, DAS '10.

[8]  P. Lax,et al.  Theory of functions of a real variable , 1959 .

[9]  Masakazu Suzuki,et al.  Comparing Approaches to Mathematical Document Analysis from PDF , 2011, 2011 International Conference on Document Analysis and Recognition.