Automatic document navigation for digital content remastering

This paper presents a novel method of automatically adding navigation capabilities to re-mastered electronic books. We first analyze the need for a generic and robust system to automatically construct navigation links into re-mastered books. We then introduce the core algorithm based on text matching for building the links. The proposed method utilizes the tree-structured dictionary and directional graph of the table of contents to efficiently conduct the text matching. Information fusion further increases the robustness of the algorithm. The experimental results on the MIT Press digital library project are discussed and the key functional features of the system are illustrated. We have also investigated how the quality of the OCR engine affects the linking algorithm. In addition, the analogy between this work and Web link mining has been pointed out.

[1]  Atsuhiro Takasu,et al.  An automated generation of an electronic library based on document image understanding , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[4]  Seinosuke Narita,et al.  Logical structure analysis of book document images using contents information , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Abdel Belaïd Recognition of table of contents for electronic library consulting , 2001, International Journal on Document Analysis and Recognition.

[6]  Yoshua Bengio,et al.  High quality document image compression with "DjVu" , 1998, J. Electronic Imaging.

[7]  Xiaofan Lin,et al.  Reliable OCR solution for digital content re-mastering , 2001, IS&T/SPIE Electronic Imaging.

[8]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[9]  Fuad Rahman,et al.  Multiple classifier decision combination strategies for character recognition: A review , 2003, Document Analysis and Recognition.

[10]  Xiaofan Lin Text-mining based journal splitting , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..