Structuring documents according to their table of contents

In this paper, we present a method for structuring a document according to the information present in its Table of Contents. The detection of the ToC as well as the determination of the parts it refers to in the document body rely on a series of generic properties characterizing any ToC, while its hierarchization is achieved using clustering techniques. We also report on the robustness and performance of the method before discussing it, in light of related work.

[1]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[2]  Jean-Luc Meunier,et al.  Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[3]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[5]  Steven J. Simske,et al.  Automatic document navigation for digital content remastering , 2003, IS&T/SPIE Electronic Imaging.

[6]  Xiaofan Lin Text-mining based journal splitting , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Atsuhiro Takasu,et al.  An automated generation of an electronic library based on document image understanding , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[9]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[10]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Abdel Belaïd,et al.  Part-of-speech tagging for table of contents recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[12]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .