Document image compression and analysis

Image compression usually considers the minimization of storage space as its main objective. It is desirable, however, to code images so that we have the ability to process the resulting representation directly. In this thesis we explore an approach to document image compression that is efficient in both space (storage requirement) and time (processing flexibility). A representation is presented in which component-level redundancy is removed by forming a prototype library and component location table. This representation forms a basis for compression and provides direct access to image components. To generate the prototype library, a new clustering approach is developed which is suitable for document image components. The distance metric is based on a character degradation model so that degraded versions of the same character will be grouped together. To achieve a lossless representation when required, the residuals are encoded efficiently using a structural distance ordering. OCR is then used as a measure of readability to evaluate the rate distortion tradeoff for lossy compression. A set of algorithms is presented for typical document processing applications which operate effectively on the compressed representation. Applications demonstrated include subdocument retrieval, skew estimation, keyword search and document image matching. Extensions of the paradigm to grayscale and graphic document images, networking and multimedia objects are discussed.

[1]  Chia-Yiu Maa Identifying the Existence of Bar Codes in Compressed Images , 1994, CVGIP Graph. Model. Image Process..

[2]  Daniel P. Lopresti,et al.  Spatial Sampling of Printed Patterns , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Ramesh C. Jain,et al.  Multimedia Computing , 2014, IEEE Multim..

[4]  Kenneth Rose Deterministic annealing, clustering, and optimization , 1991 .

[5]  Murray J. J. Holt,et al.  A Fast Binary Template Matching Algorithm for Document Image Data Cmpression , 1988, Pattern Recognition.

[6]  A. Lawrence Spitz An OCR based on character shape codes and lexical information , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  George,et al.  Computer Algorithms for Plagiarism Detection , 1989 .

[8]  A. Habibi Survey of Adaptive Image Coding Techniques , 1977, IEEE Trans. Commun..

[9]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[10]  George Nagy,et al.  A Means for Achieving a High Degree of Compaction on Scan-Digitized Printed Text , 1974, IEEE Transactions on Computers.

[11]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[12]  J. B. O'Neal,et al.  Predictive quantizing systems (differential pulse code modulation) for the transmission of television signals , 1966 .

[13]  Eiichi Tanaka,et al.  High speed string edit methods using hierarchical files and hashing technique , 1988, [1988 Proceedings] 9th International Conference on Pattern Recognition.

[14]  Luc Vincent,et al.  Blur hit-miss transform and its use in document image pattern detection , 1995, Electronic Imaging.

[15]  E. Lecolinet,et al.  Strategies in character segmentation: a survey , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[16]  Ying Zhang,et al.  Fractal color image compression using vector distortion measure , 1995, Proceedings., International Conference on Image Processing.

[17]  John M. Danskin,et al.  Entropy-based pattern matching for document image compression , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[18]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[19]  Hongsheng Cai,et al.  Wavelet transform and bit-plane encoding , 1995, Proceedings., International Conference on Image Processing.

[20]  Terrance E. Boult Dynamic digital distance maps in two dimensions , 1990, IEEE Trans. Robotics Autom..

[21]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Kenneth L. Caspari,et al.  Computer techniques in image processing , 1972 .

[23]  Frederic Fol Leymarie,et al.  Fast raster scan distance propagation on the discrete rectangular lattice , 1992, CVGIP Image Underst..

[24]  P. Pirsch Adaptive intra-interframe DPCM coder , 1982, The Bell System Technical Journal.

[25]  Jiang Liu,et al.  An efficient method for the skew normalization of a document image , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. III. Conference C: Image, Speech and Signal Analysis,.

[26]  P.G. Howard Lossless and lossy compression of text images by soft pattern matching , 1996, Proceedings of Data Compression Conference - DCC '96.

[27]  Henry S. Baird,et al.  The skew angle of printed documents , 1995 .

[28]  Rama Chellappa,et al.  Compressed-domain document retrieval and analysis , 1996, Other Conferences.

[29]  정성종,et al.  문서영상의 기울어짐 교정 알고리즘 ( An Algorithm for the Skew Normalization of Document Image ) , 1994 .

[30]  Azriel Rosenfeld,et al.  The Development of a General Framework for Intelligent Document Image Retrieval , 1996, DAS.

[31]  D. Bodson,et al.  Measurement of data compression in advanced group 4 facsimile systems , 1985, Proceedings of the IEEE.

[32]  W. J. Rucklidge E?cient Computation of the Minimum Hausdorfi Distance for Visual Recognition , 1994 .

[33]  Wojciech Szpankowski,et al.  A lossy data compression based on an approximate pattern matching , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[34]  Tim Ritchings,et al.  Flexible page segmentation using the background , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[35]  G. Anderson,et al.  Piecewise Fourier Transformation for Picture Bandwidth Compression , 1971 .

[36]  Friedrich M. Wahl A new distance mapping and its use for shape measurement on binary patterns , 1983, Comput. Vis. Graph. Image Process..

[37]  Piet W. Verbeek,et al.  An Efficient Uniform Cost Algorithm Applied to Distance Transforms , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Azriel Rosenfeld,et al.  Symbolic Compression and Processing of Document Images , 1998, Comput. Vis. Image Underst..

[39]  D. Schilling,et al.  Adaptive Delta Modulation Systems for Video Encoding , 1977, IEEE Trans. Commun..

[40]  K. Mohiuddin,et al.  Lossless Binary Image Compression Based on Pattern Matching , 1984 .

[41]  S. T. Alexander,et al.  Image compression results using the LMS adaptive algorithm , 1985, IEEE Trans. Acoust. Speech Signal Process..

[42]  Rangachar Kasturi,et al.  Evaluating the Performance of Techniques for the Extraction of Primitives from Line Drawings Composed of Horizontal and Vertical Lines , 1996, DAS.

[43]  Klara Nahrstedt,et al.  Multimedia: Computing, Communications and Applications , 1994 .

[44]  Matti Pietikäinen,et al.  A document management interface utilizing page decomposition and content-based compression , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[45]  B. Fino Relations between Haar and Walsh/Hadamard transforms , 1972 .

[46]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[47]  D. Schilling,et al.  A Variable-Step-Size Robust Delta Modulator , 1971 .

[48]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[49]  Lyman P. Hurd,et al.  Fractal image compression , 1993 .

[50]  Krystyna W. Ohnesorge,et al.  Document image compression using document analysis and block-class-specific data compression methods , 1994, Electronic Imaging.

[51]  Costas Xydeas,et al.  Recent developments in image data compression for digital facsimile , 1986 .

[52]  Ian H. Witten,et al.  Textual image compression: two-stage lossy/lossless encoding of textual images , 1994, Proc. IEEE.

[53]  David S. Doermann,et al.  Document Image Coding for Processing and Retrieval , 1998, J. VLSI Signal Process..

[54]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[55]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[56]  Ian H. Witten,et al.  Compression-based template matching , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[57]  Glen G. Langdon,et al.  An Overview of the Basic Principles of the Q-Coder Adaptive Binary Arithmetic Coder , 1988, IBM J. Res. Dev..

[58]  Ian H. Witten,et al.  Textual image compression , 1992, Data Compression Conference, 1992..

[59]  R. Hunter,et al.  International digital facsimile coding standards , 1980, Proceedings of the IEEE.

[60]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[61]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[62]  M. Vetterli Multi-dimensional sub-band coding: Some theory and algorithms , 1984 .

[63]  John W. Woods,et al.  Subband coding of images , 1986, IEEE Trans. Acoust. Speech Signal Process..

[64]  A. ROSENFELD,et al.  Distance functions on digital pictures , 1968, Pattern Recognit..

[65]  Seong-Whan Lee,et al.  A new methodology for gray-scale character segmentation and recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[66]  Mikhail J. Atallah,et al.  Pattern Matching Image Compression: Algorithmic and Empirical Results , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[67]  Omid E. Kia,et al.  Hyperdocument management for compression, transmission and processing , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[68]  Riccardo Leonardi,et al.  Perceptual embedded image coding using wavelet transforms , 1995, Proceedings., International Conference on Image Processing.

[69]  Rama Chellappa,et al.  Multiscale Document Page Segmentation Using Soft Decision Integration , 1997 .

[70]  Lawrence O'Gorman,et al.  Document Image Analysis , 1996 .

[71]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[72]  W.K. Pratt,et al.  Combined symbol matching facsimile data compression system , 1980, Proceedings of the IEEE.

[73]  Ross N. Williams,et al.  Adaptive Data Compression , 1990 .

[74]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[75]  Ching Y. Suen,et al.  Historical review of OCR research and development , 1992, Proc. IEEE.

[76]  T. Yan Duplicate Detection in Information Dissemination , 1995 .

[77]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Tomio Hirata,et al.  A Unified Linear-Time Algorithm for Computing Distance Maps , 1996, Inf. Process. Lett..

[79]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[80]  John F. Cullen,et al.  Fast and accurate skew detection algorithm for a text document or a document with straight lines , 1994, Electronic Imaging.

[81]  Jorge Herbert de Lira,et al.  Two-Dimensional Signal and Image Processing , 1989 .

[82]  Nuggehally Sampath Jayant,et al.  Adaptive delta modulation with a one-bit memory , 1970, Bell Syst. Tech. J..

[83]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[84]  Joan L. Mitchell,et al.  Probability Estimation for the Q-Coder , 1988, IBM J. Res. Dev..

[85]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[86]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[87]  Jonathan J. Hull Document Image Matching and Retrieval With Multiple Distortion-Invariant Descriptors , 1995 .

[88]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[89]  Hanan Samet,et al.  Distance Transform for Images Represented by Quadtrees , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[91]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[92]  O. Johnsen,et al.  Coding of two-level pictures by pattern matching and substitution , 1983, The Bell System Technical Journal.