Comprehensive Global Typography Extraction System for Electronic Book Documents

Book documents usually have consistent typographies throughout the whole book, including headers, footers, columns, text line directions, and fonts used in the each level of headings. Such document-level typography information is of great value for downstream document processing applications. This paper presents a document analysis system that can extract a comprehensive set of typographies used in book documents. The system consists of several components: recognition of fonts used in the body text and chapter headings; detection of page body area, headers and footers; detection of columns, text line direction and line spacing of body text. Page-association is employed in the system. The preliminary experimental results demonstrate the effectiveness of the system.

[1]  Takashi Saitoh,et al.  Document image segmentation and text area ordering , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[3]  Kevyn Collins-Thompson A Clustering-Based Algorithm for Automatic Document Separation , 2002 .

[4]  Yuan Yan Tang,et al.  Document Processing for Automatic Knowledge Acquisition , 1994, IEEE Trans. Knowl. Data Eng..

[5]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.

[6]  Zhi Tang,et al.  A mixed approach to book splitting , 2008, Electronic Imaging.

[7]  Xiaofan Lin Header and footer extraction by page association , 2003, IS&T/SPIE Electronic Imaging.

[8]  Ruiheng Qiu,et al.  A mixed approach to auto-detection of page body , 2008, Electronic Imaging.

[9]  F. Rahman,et al.  Conversion of PDF documents into HTML: a case study of document image analysis , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.