Challenges in generating bookmarks from TOC entries in e-books

ABSTRACT The task of extracting document structures from a digital e-book is difficult and is an active area of research. On the other hand, many e-books already have a table of contents (TOC) at the beginning of the document. This may lead us to believe that adding bookmarks into digital document (e-book) based on the existing TOC would be trivial. In this paper, we highlight the challenges involved in this task of automatically adding bookmarks to an existing e-book based on the TOC that exists within the document. If we are able to reliably identify the specific locations of each TOC entry within the document, the algorithms can be easily extended to identify document structures within e-books that have TOC. We describe a tool we have built called Booky that tries to add automatic PDF bookmarks to existing PDF based e-books as they have TOC as part of the document content. The tool addresses most of the challenges that have been identified while still leaving a few tricky scenarios still open.

[1]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[2]  Eric Saund,et al.  On the Reading of Tables of Contents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[3]  Jaehwa Park,et al.  Implementation of Content Analysis System for Recognition of Journals_ Table of Contents , 2007 .

[4]  Yongjun Cho,et al.  Touch-Bookmark: a lightweight navigation and bookmarking technique for e-books , 2011, CHI Extended Abstracts.

[5]  Daniel Jackson,et al.  iBookmark: locative texts and place-based authoring , 2009, CHI Extended Abstracts.

[6]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.

[7]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  David F. Brailsford,et al.  Document analysis of PDF files: methods, results and implications , 1995 .

[10]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[11]  Zhi Tang,et al.  Analysis of Book Documents' Table of Content Based on Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  Jean-Luc Meunier,et al.  Structuring documents according to their table of contents , 2005, DocEng '05.

[13]  J. Park,et al.  Implementation of Content Analysis System for Recognition of Journals_ Table of Contents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Max Mühlhäuser,et al.  Digital paper bookmarks: collaborative structuring, indexing and tagging of paper documents , 2008, CHI Extended Abstracts.