Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain

Automatic feature extraction plays a pivotal role in defining the overall performance of any Document Image Analysis system, which conventionally operates directly over uncompressed images, although most of the real time systems such as fax machines, digital libraries and e-governance applications accrue and archive the documents in the compressed form for the sake of storage and transfer efficiencies. However, this infers that the compressed documents need to be decompressed before carrying out any operation or analysis which warrants additional computing resources. This limitation in existing systems instigates motivation to explore for feature extraction techniques directly from the compressed documents and eventually design a document analysis system that works directly in compressed domain. Therefore, this research work proposes to extract novel correlation-entropy features directly from run-length compressed TIFF documents. Further, the research work also investigates different methods to demonstrate some of the straight forward application of the proposed features in carrying out compressed document image analysis such as text and non-text component detection, and subsequently performing compressed text line segmentation and characterization, all carried out in the compressed version of the printed text document without going through the stage of decompression. Finally, the experimental results reported validate the developed algorithms and also illustrate that the proposed features are quite powerful in distinguishing compressed text and non-text components.

[1]  D. Santhi,et al.  Location of Optical Disc in Retinal Image , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[2]  Kazem Taghva,et al.  Document analysis by processing JBIG-encoded images , 2005, International Journal of Document Analysis and Recognition (IJDAR).

[3]  Bidyut Baran Chaudhuri,et al.  Direct Processing of Run Length Compressed Document Image for Segmentation and Characterization of a Specified Block , 2014, ArXiv.

[4]  P. Nagabhushan,et al.  Entropy Quantifiers Useful for Establishing Equivalence between Text Document Images , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[5]  Hamdy A. Taha,et al.  Operations research: an introduction / Hamdy A. Taha , 1982 .

[6]  Bidyut Baran Chaudhuri,et al.  Extraction of line-word-character segments directly from run-length compressed printed text-documents , 2013, 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[7]  Yue Lu,et al.  Document retrieval from compressed images , 2003, Pattern Recognit..

[8]  Bidyut Baran Chaudhuri,et al.  Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents , 2013, ACPR.

[9]  Yue Lu,et al.  Word searching in CCITT group 4 compressed document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[10]  Jiangying Zhou,et al.  Page segmentation and classification , 1992, CVGIP Graph. Model. Image Process..

[11]  Bidyut Baran Chaudhuri,et al.  Automatic Detection of Font Size Straight from Run Length Compressed Text Documents , 2014, ArXiv.

[12]  Kuntal Roy Neural Network Based Macromodels for High Level Power Estimation , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[13]  Bidyut Baran Chaudhuri,et al.  Entropy Computations of Document Images in Run-Length Compressed Domain , 2014, 2014 Fifth International Conference on Signal and Image Processing.

[14]  Jiang Dalin,et al.  Survey on the technology of image processing based on DCT compressed domain , 2011, 2011 International Conference on Multimedia Technology.

[15]  Jayanta Mukhopadhyay,et al.  Image and Video Processing in the Compressed Domain , 2011 .