Conversion of PDF Books in ePub Format

In the last years the interest in e-book readers is significantly growing. Two main document formats are supported by most devices: PDF and ePub. The PDF format is widely used to share documents allowing a cross-platform readability. However, it is not ideal for a comfortable reading on small screens. On the opposite, the ePub format is re-flowable and it is well suited for e-book readers. In this paper we describe a system for the conversion of PDF books to the ePub format aiming at inverting the text formatting made during the pagination. To this purpose, layout analysis techniques are performed to identify the book's table of contents and the main functional regions such as chapters, paragraphs, and notes.

[1]  Xiaofan Lin,et al.  Detection and analysis of table of contents based on content association , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[2]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.

[3]  Simone Marinai,et al.  Metadata Extraction from PDF Papers for Digital Library Ingest , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  Gabriella Kazai,et al.  ICDAR 2011 Book Structure Extraction Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Jean-Luc Meunier,et al.  On tables of contents and how to recognize them , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[6]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[7]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[8]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.