Table of Contents Recognition and Extraction for Heterogeneous Book Documents

Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely "flat", "ordered", and "divided", giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

[1]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Jean-Luc Meunier,et al.  Structuring documents according to their table of contents , 2005, DocEng '05.

[3]  Takahiro Watanabe,et al.  Identifying contents page of documents , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[4]  Automated detection and segmentation of table of contents page and index pages from document images , 2003, 12th International Conference on Image Analysis and Processing, 2003.Proceedings..

[5]  Eric Saund,et al.  On the Reading of Tables of Contents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[6]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[7]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[8]  Sherif M. Yacoub,et al.  Identification of document structure and table of content in magazine archives , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Gabriella Kazai,et al.  Setting up a competition framework for the evaluation of structure extraction from OCR-ed books , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[11]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.

[12]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  Jean-Luc Meunier,et al.  On tables of contents and how to recognize them , 2009, International Journal of Document Analysis and Recognition (IJDAR).

[14]  Atsuhiro Takasu,et al.  An automated generation of an electronic library based on document image understanding , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  Chandrashekar Ramanathan,et al.  Challenges in generating bookmarks from TOC entries in e-books , 2012, DocEng '12.

[16]  Zhi Tang,et al.  Analysis of Book Documents' Table of Content Based on Clustering , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Xiaofan Lin,et al.  Detection and analysis of table of contents based on content association , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[18]  Gabriella Kazai,et al.  ICDAR 2011 Book Structure Extraction Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[19]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.