Implementation of a Journal's Table of Contents Separation System based on Contents Analysis

In this paper, a method for automatic indexing of contents to reduce effort for inputting paper information and constructing index is considered. Existing document analysis methods can`t analyse various table of contents of journal paper formats efficiently because they have many exceptions. In this paper, various contents formats for journals, which have different features from those for general documents, are analysed and described. The principal elements that we want to represent are titles, authors, and pages for each papers. Thus, the three principal elements are modeled according to the order of their arrangement, and their features are extracted. And a table of content recognition system of journal is implemented, based on the proposed modeling method. The accuracy of exact extraction ratio of 91.5% on title, author, and page type on 660 published papers of various journals is obtained.

[1]  Yuki Hirayama,et al.  A block segmentation method for document images with complicated column structures , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Amit Kumar Das,et al.  Automated detection and segmentation of table of contents page from document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Abdel Belaïd,et al.  Part-of-speech tagging for table of contents recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[4]  Randy Crane A Simplified Approach to Image Processing: Classical and Modern Techniques in C , 1996 .

[5]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[6]  Tomohiro Yoshikawa,et al.  Image-based Structure analysis for a Table of Contents and Conversion to XML Documents , 2001 .

[7]  Xiaofan Lin,et al.  Detection and analysis of table of contents based on content association , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[8]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[10]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[11]  Sargur N. Srihari,et al.  Classification of newspaper image using texture analysis , 1989 .

[12]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..