A mixed approach to book splitting

In this paper, we present a hybrid approach to splitting a book document into individual chapters. We use multiple sources of information to obtain a reliable assessment of the chapter title pages. These sources are produced by four methods: blank space detection, font analysis, header and footer association, and table of content (TOC) analysis. Finally, a combination component is used to score potential chapter title pages and select the best candidates. This approach takes full advantage of various kinds of information such as page header and footer, layout, and keywords. It works well even without the information of TOC which is crucial for most previous similar researches. Experiments show that this approach is robust and reliable.

[1]  Xiaofan Lin Header and footer extraction by page association , 2003, IS&T/SPIE Electronic Imaging.

[2]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[3]  Jean-Luc Meunier,et al.  A System for Converting PDF Documents into Structured XML Format , 2006, Document Analysis Systems.

[4]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Xiaofan Lin Text-mining based journal splitting , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[7]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.