Part-of-speech tagging for table of contents recognition

A labeling approach to automatic recognition of tables of contents (TOC)s is described. A prototype is used for consulting electronically, scientific papers in a digital library system named Calliope. This method operates on an a roughly structured ASCII file, produced with OCR. Labeling is based on a part of speech tagging. Tagging is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are then grouped in the title and author strings and reduced in canonical forms according to contextual rules. Non-labeled tokens are integrated in one or another field per either applying contextual correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different TOC layouts and character recognition qualities. Without manual intervention, 95.41% rate of correct segmentation was obtained on 38 journals including 2703 articles and 81.74% rate of correct field extraction.

[1]  L. O'Gorman Image and document processing techniques for the RightPages electronic library system , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[2]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[3]  Atsuhiro Takasu,et al.  A document understanding method for database construction of an electronic library , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[4]  Atsuhiro Takasu,et al.  A rule learning method for academic document image processing , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[5]  Lawrence O'Gorman,et al.  The RightPages image-based electronic library for alerting and browsing , 1992, Computer.