Algorithms to separate text from a mixed text/graphic document and generate a succinct description for this complex graphic

The objective of this paper is to describe an approach to separate text from a mixed text/graphic document, and describe this graphic as overlapping meaningful shapes. Accuracy in the reconstruction of the mixed text/graphic document from the description file is also reported. This paper is a continuation of our previous work, which was mainly on engineering drawings with polygonal shapes. This paper focuses on documents consisting of any curved shape components with text. In this paper algorithms are designed to automate the process of generation of loops with minimum redundancy from the bit map of the image, and to break the interweaved complex loops into simpler interpretable shapes of curved segments. Finally, a succinct description file can be established for the whole image, thus achieving drastic saving in memory when archiving the document images. Effectiveness of the algorithms has been evaluated through experiments on a large number of mixed text/graphic documents. Results show that the algorithms developed are computationally efficient. Once the text is separated from the graphic, the graphic image is then decomposed into the meaningful component parts, the data reduction achieved through this succinct description is extremely high. Even for those silhouettes of curved shape, an approach, called concatenated-arc representation, is developed for their description. With this concatenated-arc approach, much fewer number of arc segments are needed than those needed by line segment approximation. Shapes reconstructed from these description files match closely with the original ones, even for the very complex graphics.