Metadata Extraction from PDF Papers for Digital Library Ingest

In this paper we analyze our recent research on the use of document analysis techniques for metadata extraction from PDF papers. We describe a package that is designed to extract basic metadata from these documents. The package is used in combination with a digital library software suite to easily build personal digital libraries. The proposed software is based on a suitable combination of several techniques that include PDF parsing, low level document image processing, and layout analysis. In addition, we use the information gathered from a widely known citation database (DBLP) to assist the tool in the difficult task of author identification. The system is tested on some paper collections selected from recent conference proceedings.

[1]  Eric Saund,et al.  On the Reading of Tables of Contents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[2]  Edward A. Fox,et al.  Digital libraries , 1995, CACM.

[3]  D. J. Lee,et al.  Relevance Feedback Query Refinement for PDF Medical Journal Articles , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[4]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[5]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[6]  Robert Dale,et al.  Evidence-Based Information Extraction for High Accuracy Citation and Author Name Identification , 2007, RIAO.

[7]  José Luis Borbinha,et al.  The case of the digitized works at a National Digital Library , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[8]  Ian H. Witten,et al.  Assembling and enriching digital library collections , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[9]  Siyuan Chen,et al.  Simultaneous Layout Style and Logical Entity Recognition in a Heterogeneous Collection of Documents , 2007 .

[10]  George Buchanan,et al.  Dynamic Digital Library Construction and Configuration , 2004, ECDL.

[11]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.

[13]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  William Y. Arms Digital Libraries , 1999 .

[15]  Amit Kumar Das,et al.  Detection and segmentation of table of contents and index pages from document images , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[16]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[17]  Kun Bai,et al.  Searching for Tables in Digital Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[18]  Jean-Luc Meunier,et al.  Structuring documents according to their table of contents , 2005, DocEng '05.

[19]  Thomas M. Breuel,et al.  Example-Based Logical Labeling of Document Title Page Images , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).