Detection and analysis of table of contents based on content association

AbstractAs a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi-page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed.

[1]  Lawrence O'Gorman,et al.  The RightPages image-based electronic library for alerting and browsing , 1992, Computer.

[2]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Steven J. Simske,et al.  Creating digital libraries: content generation and re-mastering , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[4]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Xiaofan Lin,et al.  Reliable OCR solution for digital content re-mastering , 2001, IS&T/SPIE Electronic Imaging.

[6]  Atsuhiro Takasu,et al.  An automated generation of an electronic library based on document image understanding , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  Steven J. Simske,et al.  Automatic document navigation for digital content remastering , 2003, IS&T/SPIE Electronic Imaging.

[8]  Xiaofan Lin,et al.  Impact of imperfect OCR on part-of-speech tagging , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[10]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[11]  Xiaofan Lin Text-mining based journal splitting , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[13]  Eugene W. Myers,et al.  Whole-genome DNA sequencing , 1999, Comput. Sci. Eng..

[14]  Yalin Wang,et al.  Table Detection via Probability Optimization , 2002, Document Analysis Systems.

[15]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[16]  Takahiro Watanabe,et al.  Identifying contents page of documents , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[17]  Daniel P. Lopresti,et al.  Medium-independent table detection , 1999, Electronic Imaging.

[18]  Trevor I. Dix,et al.  Shortest Path and Closure Algorithms for Banded Matrices , 1991, Inf. Process. Lett..

[19]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[20]  Gustaf Neumann,et al.  MSEEC – A Multi Search Engine with Multiple Clustering , 2000 .