Symbolic Compression and Processing of Document Images

In this paper, we describe a compression and representation scheme which exploits the component-level redundancy found within a document image. The approach identifies patterns which appear repeatedly, represents similar patterns with a single prototype, stores the location of pattern instances, and codes the residuals between the prototypes and the pattern instances. Using a novel encoding scheme, we provide a representation that facilitates scalable lossy compression and progressive transmission and supports document image analysis in the compressed domain. We motivate the approach, provide details of the encoding procedures, report compression results, and describe a class of document image understanding tasks that operate on the compressed representation.

[1]  A. Lawrence Spitz An OCR based on character shape codes and lexical information , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  K. Mohiuddin,et al.  Lossless Binary Image Compression Based on Pattern Matching , 1984 .

[3]  John F. Cullen,et al.  Fast and accurate skew detection algorithm for a text document or a document with straight lines , 1994, Electronic Imaging.

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  David S. Doermann,et al.  Structure-preserving document image compression , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[6]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[7]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[8]  Lyman P. Hurd,et al.  Fractal image compression , 1993 .

[9]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[10]  Joan L. Mitchell,et al.  Probability Estimation for the Q-Coder , 1988, IBM J. Res. Dev..

[11]  George Nagy,et al.  A Means for Achieving a High Degree of Compaction on Scan-Digitized Printed Text , 1974, IEEE Transactions on Computers.

[12]  Omid Ebrahimi Kia,et al.  Document image compression and analysis , 1997 .

[13]  David S. Doermann,et al.  OCR-based rate-distortion analysis of residual coding , 1997, Proceedings of International Conference on Image Processing.

[14]  S. V. Nagaraj,et al.  Optimal Binary Search Trees , 1997, Theor. Comput. Sci..

[15]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[16]  Glen G. Langdon,et al.  An Overview of the Basic Principles of the Q-Coder Adaptive Binary Arithmetic Coder , 1988, IBM J. Res. Dev..

[17]  Ian H. Witten,et al.  Textual image compression , 1992, Data Compression Conference, 1992..

[18]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[19]  David S. Doermann,et al.  Structural compression for document analysis , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[20]  Karel Culik,et al.  Finite automata based compression of bi-level images , 1996, Proceedings of Data Compression Conference - DCC '96.

[21]  W.K. Pratt,et al.  Combined symbol matching facsimile data compression system , 1980, Proceedings of the IEEE.

[22]  Rama Chellappa,et al.  Compressed-domain document retrieval and analysis , 1996, Other Conferences.

[23]  D. Bodson,et al.  Measurement of data compression in advanced group 4 facsimile systems , 1985, Proceedings of the IEEE.

[24]  Jeffrey Scott Vitter,et al.  Design and analysis of dynamic Huffman codes , 1987, JACM.

[25]  Costas Xydeas,et al.  Recent developments in image data compression for digital facsimile , 1986 .

[26]  Ian H. Witten,et al.  Compression-based template matching , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[27]  John M. Danskin,et al.  Entropy-based pattern matching for document image compression , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[28]  Jiang Liu,et al.  An efficient method for the skew normalization of a document image , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. III. Conference C: Image, Speech and Signal Analysis,.

[29]  P.G. Howard Lossless and lossy compression of text images by soft pattern matching , 1996, Proceedings of Data Compression Conference - DCC '96.

[30]  Henry S. Baird,et al.  The skew angle of printed documents , 1995 .

[31]  Wojciech Szpankowski,et al.  A lossy data compression based on an approximate pattern matching , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[32]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[33]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[35]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[36]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[37]  Ian H. Witten,et al.  Textual image compression: two-stage lossy/lossless encoding of textual images , 1994, Proc. IEEE.

[38]  ImagesOmid Kia Integrated Segmentation and Clustering forEnhanced Compression of Document , 1997 .

[39]  O. Johnsen,et al.  Coding of two-level pictures by pattern matching and substitution , 1983, The Bell System Technical Journal.

[40]  Chia-Yiu Maa Identifying the Existence of Bar Codes in Compressed Images , 1994, CVGIP Graph. Model. Image Process..

[41]  Yasuaki Nakano,et al.  An algorithm for the skew normalization of document image , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[42]  Kenneth Rose Deterministic annealing, clustering, and optimization , 1991 .

[43]  Mikhail J. Atallah,et al.  Pattern Matching Image Compression: Algorithmic and Empirical Results , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Robert G. Gallager,et al.  Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.