Analysis of Book Documents' Table of Content Based on Clustering

Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it generates TOC entries and extracts their hierarchical structure under the guidance of the model. More specifically, broken lines are taken into account in the method. Experimental results show that this method achieves high accuracy and efficiency. In addition, this method has been successfully applied in a commercial E-book production software package.

[1]  Takahiro Watanabe,et al.  Identifying contents page of documents , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[2]  Automated detection and segmentation of table of contents page and index pages from document images , 2003, 12th International Conference on Image Analysis and Processing, 2003.Proceedings..

[3]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[4]  Jean-Luc Meunier,et al.  Structuring documents according to their table of contents , 2005, DocEng '05.

[5]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[6]  Xiaofan Lin,et al.  Detection and analysis of table of contents based on content association , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[8]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[9]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Sherif M. Yacoub,et al.  Identification of document structure and table of content in magazine archives , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  Atsuhiro Takasu,et al.  An automated generation of an electronic library based on document image understanding , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.