Logical structure analysis of book document images using contents information

Numerous studies have so far been carried out extensively for the analysis of document image structure, with particular emphasis placed on media conversion and layout analysis. For the conversion of a collection of books in a library into the form of hypertext documents, a logical structure extraction technology is indispensable, in addition to document layout analysis. The table of contents of a book generally involves very concise and faithful information to represent the logical structure of the entire book. That is to say, we can efficiently analyze the logical structure of a book by making full use of its contents pages. This paper proposes a new approach for document logical structure analysis to convert document images and contents information into an electronic document. First, the contents pages of a book are analyzed to acquire the overall document logical structure. Thereafter, we are able to use this information to acquire the logical structure of all the pages of the book by analyzing consecutive pages of a portion of the book. Test results demonstrate very high discrimination rates: up to 97.6% for the headline structure, 99.4% for the text structure, 97.8% for the page-number structure and almost 100% for the head-foot structure.