Structure extraction from PDF-based book documents

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

[1]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[2]  Takao Nishizeki,et al.  Graph Theory and Algorithms , 1981, Lecture Notes in Computer Science.

[3]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[4]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[5]  Frank M. Shipman,et al.  Finding and using implicit structure in human-organized spatial layouts of information , 1995, CHI '95.

[6]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[7]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[8]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[9]  Donato Malerba,et al.  Transforming paper documents into XML format with WISDOM++ , 2001, International Journal on Document Analysis and Recognition.

[10]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[11]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[12]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  Xiaofan Lin Header and footer extraction by page association , 2003, IS&T/SPIE Electronic Imaging.

[14]  Sung-Bae Cho,et al.  Logical Structure Analysis and Generation for Structured Documents: A Syntactic Approach , 2003, IEEE Trans. Knowl. Data Eng..

[15]  T. Breuel Layout Analysis based on Text Line Segment Hypotheses , 2003 .

[16]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[17]  Jean-Luc Meunier,et al.  Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[18]  Jean-Luc Bloechle,et al.  Towards a canonical and structured representation of PDF documents through reverse engineering , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[19]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[20]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.

[21]  Alston W. Purvis,et al.  Jan Tschichold, Master Typographer: His Life, Work and Legacy , 2008 .

[22]  Jean-Luc Bloechle,et al.  Dolores: An  Interactive and Class-Free Approach for Document Logical Restructuring , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[23]  George Buchanan,et al.  Improving navigation interaction in digital documents , 2008, JCDL '08.

[24]  Ruiheng Qiu,et al.  Comprehensive Global Typography Extraction System for Electronic Book Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[25]  Jan-Ming Ho,et al.  BibPro: A Citation Parser Based on Sequence Alignment Techniques , 2008, 22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008).

[26]  Tamir Hassan User-Guided Wrapping of PDF Documents Using Graph Matching Techniques , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[27]  Zhi Tang,et al.  CEBBIP: a parser of bibliographic information in chinese electronic books , 2009, JCDL '09.

[28]  Zhi Tang,et al.  Analysis of Book Documents' Table of Content Based on Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[29]  Evgeniy Bart,et al.  Information extraction by finding repeated structure , 2010, DAS '10.

[30]  D. Malerba,et al.  Preference Learning for Document Image Analysis , 2010 .

[31]  Timothy Lethbridge,et al.  Reengineering PDF-based documents targeting complex software specifications , 2011, Int. J. Knowl. Web Intell..