Document image segmentation and text area ordering

A system for document image segmentation and ordering text areas is described and applied to both Japanese and English complex printed page layouts. There is no need to make any assumption about the shape of blocks, hence the segmentation technique can handle not only skewed images without skew-correction but also documents where column are not rectangular. In this technique, on the bottom-up strategy, the connected components are extracted from the reduced image, and classified according to their local information. The connected components are merged into lines, and lines are merged into areas. Extracted text areas are classified as body, caption, header, and footer. A tree graph of the layout of body texts is made, and we get the order of texts by preorder traversal on the graph. The authors introduce the influence range of each node, a procedure for the title part, and extraction of the white horizontal separator. Making it possible to get good results on various documents. The total system is fast and compact.<<ETX>>

[1]  T. Pavlidis,et al.  Page segmentation without rectangle assumption , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[2]  Haruo Asada,et al.  Major components of a complete text reading system , 1992 .

[3]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[4]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[5]  Sargur N. Srihari,et al.  Reading newspaper text , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.