Automated detection and segmentation of table of contents page from document images

With an aim to extract the structural information from the table of contents (TOC) to help develop a digital document library, the requirement of identifying/segmenting the TOC page is obvious. The objective to create a digital document library is to provide a non-labour intensive, cheap and flexible way of storing, representing and managing the paper document in electronic form to facilitate indexing, viewing, printing and extracting the intended portions. Information from the TOC pages is to be extracted for use in a document database for effective retrieval of the required pages. We present a fully automatic identification and segmentation of a table of contents (TOC) page from a scanned document.

[1]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[2]  Lawrence O'Gorman,et al.  The RightPages image-based electronic library for alerting and browsing , 1992, Computer.

[3]  L. O'Gorman Image and document processing techniques for the RightPages electronic library system , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[4]  Bin Yu,et al.  Page segmentation using document model , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Atsuhiro Takasu,et al.  A rule learning method for academic document image processing , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[7]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Proceedings Seventh International Conference on Document Analysis and Recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Abdel Belaïd,et al.  Part-of-speech tagging for table of contents recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[10]  Abdel Belaïd,et al.  Page segmentation by segment tracing , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Amit Kumar Das,et al.  A fast algorithm for skew detection of document images using morphology , 2001, International Journal on Document Analysis and Recognition.

[12]  Yasuto Ishitani Document layout analysis based on emergent computation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[13]  Su Chen,et al.  Document layout analysis using recursive morphological transforms , 1996 .

[14]  Kuo-Chin Fan,et al.  Segmentation and classification of mixed text/graphics/image documents , 1994, Pattern Recognit. Lett..