Knowledge-based document image analysis system

A knowledge-based document layout analysis system has been designed and implemented to convert paper documents into an electronic form effectively. The system extracts characters, word, and text lines from a given text column. The domain knowledge includes generic typographical conventions and features of printed symbols in a given document, but excludes publication-specific layout features. A domain-specific knowledge-representation scheme called character prototype has been introduced to represent printed symbols. The character prototypes of each font were generated from font definitions, interactively, and automatically from training data. The system was coded using C and Common Lisp. Experimental results showed that it correctly extracted an average of 96 percent of text-lines from digitized text-columns written in English. Almost all word blocks were properly generated from the character boxes in the extracted text-lines. The character prototype scheme accurately represented the English, French and Japanese alphabets, Chinese characters, and Bengali words, and text lines were correctly extracted from documents written in these languages. The basic representation scheme for printed document is the X-Y tree. The term X-Y tree is used to describe a family of hierarchical data structures. Their common property is that they represent a recursive decomposition of space (isothetic rectangles). Additional properties depend on the selected local segmentation method, and these properties, which lead to a classification of X-Y trees, have been identified. Various utility algorithms have been developed: queries, insertions, deletion, and compression. A significant advantage of this structure is that it can represent hierarchically both logical and layout structure of a document without using additional pointers.