Farsi and Arabic document images lossy compression based on the mixed raster content model

Recently, the mixed raster content model was proposed for compound document image compression. Most state-of-the-art document image compression methods, such as DjVu, work on the basis of this model but they have some disadvantages, especially for Farsi and Arabic document images. First, the Farsi/Arabic script has some characteristics which can be used to further improve the compression performance. Second, existing segmentation methods have focused on well-separating the textual objects from the background and/or optimizing the rate-distortion trade-off; nevertheless, they have not considered the text readability and OCR facility. Third, these methods usually suffer from the undesired jaggy artifact and misclassifying the important textual details. In this paper, MRC-based document image compression method is proposed which compromises rate-distortion trade-off better than the existing state-of-the-art document compression methods. The proposed method has higher performance in the aspects of segmentation, bi-level mask layer compression, OCR facility, and the overall compression. It uses a 1D pattern matching technique for compression of mask layer. It also uses a segmentation method which is sensitive enough to the small textual objects. Experimental results show that the proposed method has considerably higher compression performance than that of the state-of-the-art compression method DjVu, as high as 1.75–2.3.

[1]  Charles A. Bouman,et al.  High-Quality MRC Document Coding , 2006, IEEE Transactions on Image Processing.

[2]  Pamela C. Cosman,et al.  Fast and memory efficient text image compression with JBIG2 , 2003, IEEE Trans. Image Process..

[3]  Bing-Fei Wu,et al.  Algorithms for compressing compound document images with large text/background overlap , 2004 .

[4]  Edmund Y. Lam,et al.  Compound document compression with model-based biased reconstruction , 2004, J. Electronic Imaging.

[5]  Faouzi Kossentini,et al.  The emerging JBIG2 standard , 1998, IEEE Trans. Circuits Syst. Video Technol..

[6]  Charles A. Bouman,et al.  Segmentation for MRC compression , 2007, Electronic Imaging.

[7]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[8]  Alexandre Zaghetto,et al.  Iterative pre- and post-processing for MRC layers of scanned documents , 2008, 2008 15th IEEE International Conference on Image Processing.

[9]  Daniel P. Huttenlocher,et al.  Digipaper: a versatile color document image representation , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[10]  Basil Manns,et al.  JPEG 2000 options for document image compression , 2001, IS&T/SPIE Electronic Imaging.

[11]  Özgür Ulusoy,et al.  Content-based retrieval of historical Ottoman documents stored as textual images , 2004, IEEE Transactions on Image Processing.

[12]  Reiner Eschbach,et al.  Color Imaging VIII: Processing, Hardcopy, and Applications , 2003 .

[13]  David S. Doermann,et al.  Residual coding in document image compression , 2000, IEEE Trans. Image Process..

[14]  Zhigang Fan,et al.  Segmentation for mixed raster contents with multiple extracted constant color areas , 2005, IS&T/SPIE Electronic Imaging.

[15]  Bülent Sankur,et al.  Survey over image thresholding techniques and quantitative performance evaluation , 2004, J. Electronic Imaging.

[16]  Paul G. Howard,et al.  Text Image Compression Using Soft Pattern Matching , 1997, Comput. J..

[17]  Ching Y. Suen,et al.  Character Recognition Systems: A Guide for Students and Practitioners , 2007 .

[18]  Walter S. Rosenbaum,et al.  Word Autocorrelation Redundancy Match (WARM) Technology , 1982, IBM J. Res. Dev..

[19]  Hui Cheng,et al.  Rate-distortion-based segmentation for MRC compression , 2001, IS&T/SPIE Electronic Imaging.

[20]  Ming Xu,et al.  Simple segmentation algorithm for mixed raster contents image representation , 2001, IS&T/SPIE Electronic Imaging.

[21]  Henrique S. Malvar,et al.  A foreground-background separation algorithm for image compression , 2004, Data Compression Conference, 2004. Proceedings. DCC 2004.

[22]  Ian H. Witten,et al.  Textual image compression: two-stage lossy/lossless encoding of textual images , 1994, Proc. IEEE.

[23]  Yoshua Bengio,et al.  High quality document image compression with "DjVu" , 1998, J. Electronic Imaging.

[24]  O. Johnsen,et al.  Coding of two-level pictures by pattern matching and substitution , 1983, The Bell System Technical Journal.

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[26]  Donggang Yu,et al.  Content-lossless document image compression based on structural analysis and pattern matching , 2000, Pattern Recognit..

[27]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[28]  Trac D. Tran,et al.  Optimizing block-thresholding segmentation for multilayer compression of compound images , 2000, IEEE Trans. Image Process..

[29]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[30]  Lloyd McIntyre,et al.  New Developments in Color Facsimile and Internet Fax , 1997, Color Imaging Conference.

[31]  Uwe-Erik Martin,et al.  Scalable DSP architecture for high-speed color document compression , 2000, IS&T/SPIE Electronic Imaging.

[32]  Anil K. Jain,et al.  Goal-Directed Evaluation of Binarization Methods , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  George Nagy,et al.  A Means for Achieving a High Degree of Compaction on Scan-Digitized Printed Text , 1974, IEEE Transactions on Computers.

[34]  Ming Xu,et al.  Mixed raster content (MRC) model for compound image compression , 1998, Electronic Imaging.

[35]  Murray J. J. Holt,et al.  A Fast Binary Template Matching Algorithm for Document Image Data Cmpression , 1988, Pattern Recognition.

[36]  Muhammad Sarfraz,et al.  On Offline Arabic Character Recognition , 2005 .

[37]  W.K. Pratt,et al.  Combined symbol matching facsimile data compression system , 1980, Proceedings of the IEEE.

[38]  Paul B. Kantor,et al.  Document Recognition and Retrieval VIII , 2000 .

[39]  Hui Cheng,et al.  Document compression using rate-distortion optimized segmentation , 2001, J. Electronic Imaging.

[40]  Pamela C. Cosman,et al.  Dictionary design for text image compression with JBIG2 , 2001, IEEE Trans. Image Process..

[41]  Øivind Due Trier,et al.  Evaluation of Binarization Methods for Document Images , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Ming-Kuei Hu,et al.  Visual pattern recognition by moment invariants , 1962, IRE Trans. Inf. Theory.

[43]  Costas Xydeas,et al.  Recent developments in image data compression for digital facsimile , 1986 .

[44]  Kai Uwe Barthel,et al.  New technology for raster document image compression , 1999, Electronic Imaging.