Background Structure in Document Images

A method for analyzing the structure of the white background in document images is described, along with applications to the problem of isolating blocks of machine-printed text. The approach is based on computational-geometry algorithms for off-line enumeration of maximal white rectangles and on-line rectangle unification. These support a fast, simple, and general heuristic for geometric layout segmentation, in which white space is covered greedily by rectangles until all text blocks are isolated. Design of the heuristic can be substantially automated by an analysis of the empirical statistical distribution of properties of covering rectangles: for example, the stopping rule can be chosen by Rosenblatt’s perceptron training algorithm. Experimental trials show good behavior on the large and useful class of textual Manhattan layouts. On complex layouts from English-language technical journals of many publishers, the method finds good segmentations in a uniform and nearly parameter-free manner. On a variety of non-Latin texts, some with vertical text lines, the method finds good segmentations without prior knowledge of page and text-line orientation.