Separation of text and background regions for high performance document image compression

We describe a document image segmentation algorithm to classify a scanned document into different regions such as text/line drawings, pictures, and smooth background. The proposed scheme is relatively independent of variations in text font style, size, intensity polarity and of string orientation. It is intended for use in an adaptive system for document image compression. The principal parts of the algorithm are the generation of the foreground and background layers and the application of hierarchical singular value decomposition (SVD) in order to smoothly fill the blank regions of both layers so that the high compression ratio can be achieved. The performance of the algorithm, both in terms of its effectiveness and computational efficiency, was evaluated using several test images and showed superior performance compared to other techniques.

[1]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Charles A. Bouman,et al.  Text Segmentation for MRC Document Compression , 2011, IEEE Transactions on Image Processing.

[3]  Salvatore Tabbone,et al.  Text extraction from graphical document images using sparse representation , 2010, DAS '10.

[4]  Yoshua Bengio,et al.  High quality document image compression with "DjVu" , 1998, J. Electronic Imaging.

[5]  Matti Pietikäinen,et al.  Adaptive document image binarization , 2000, Pattern Recognit..

[6]  Wayne Niblack,et al.  An introduction to digital image processing , 1986 .

[7]  R.L. de Queiroz On data filling algorithms for MRC layers , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[8]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Yuanping Zhu,et al.  Improving Scene Text Detection by Scale-Adaptive Segmentation and Weighted CRF Verification , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Yann LeCun,et al.  A general segmentation scheme for DjVu document compression , 2002 .

[11]  N. Otsu A threshold selection method from gray level histograms , 1979 .