Cross-reference identification within a PDF document

Cross-references, such like footnotes, endnotes, figure/table captions, references, are a common and useful type of page elements to further explain their corresponding entities in the target document. In this paper, we focus on cross-reference identification in a PDF document, and present a robust method as a case study of identifying footnotes and figure references. The proposed method first extracts footnotes and figure captions, and then matches them with their corresponding references within a document. A number of novel features within a PDF document, i.e., page layout, font information, lexical and linguistic features of cross-references, are utilized for the task. Clustering is adopted to handle the features that are stable in one document but varied in different kinds of documents so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improve the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify cross-reference in a PDF document.

[1]  Prasenjit Mitra,et al.  Summarizing figures, tables, and algorithms in scientific publications to augment search results , 2012, TOIS.

[2]  Volker Sorge,et al.  A Linear Grammar Approach to Mathematical Formula Recognition from PDF , 2009, Calculemus/MKM.

[3]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Giovanni Soda,et al.  Conversion of PDF Books in ePub Format , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Hagit Shatkay,et al.  An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[6]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.

[7]  Ruiheng Qiu,et al.  A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures , 2011, 2011 International Conference on Document Analysis and Recognition.

[8]  C. Lee Giles,et al.  Figure Metadata Extraction from Digital Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  F. Rahman,et al.  Conversion of PDF documents into HTML: a case study of document image analysis , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[10]  Volker Sorge,et al.  Mathematical formula identification and performance evaluation in PDF documents , 2013, International Journal on Document Analysis and Recognition (IJDAR).

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[13]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[14]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[15]  Zhi Tang,et al.  Reflowing-driven paragraph recognition for electronic books in PDF , 2011, Electronic Imaging.

[16]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[17]  David F. Brailsford,et al.  Document analysis of PDF files: methods, results and implications , 1995 .