Spotting of Keyword Directly in Run-Length Compressed Documents

With the rapid growth of digital libraries, e-governance and Internet applications, huge volume of documents are being generated, communicated and archived in the compressed form to provide better storage and transfer efficiencies. In such a large repository of compressed documents, the frequently used operations like keyword searching and document retrieval have to be carried out after decompression and subsequently with the help of an OCR. Therefore developing keyword spotting technique directly in compressed documents is a potential and challenging research issue. In this backdrop, the paper presents a novel approach for searching keywords directly in run-length compressed documents without going through the stages of decompression and OCRing. The proposed method extracts simple and straightforward font size invariant features like number of run transitions and correlation of runs over the selected regions of test words, and matches with that of the user queried word. In the subsequent step, based on the matching score, the keywords are spotted in the compressed document. The idea of decompression-less and OCR-less word spotting directly in compressed documents is the major contribution of this paper. The method is experimented on a data set of compressed documents and the preliminary results obtained validate the proposed idea.

[1]  Francine Chen,et al.  Detection and location of multicharacter sequences in lines of imaged text , 1996, J. Electronic Imaging.

[2]  Bidyut Baran Chaudhuri,et al.  Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents , 2013, ACPR.

[3]  Yue Lu,et al.  Word searching in CCITT group 4 compressed document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Chew Lim Tan,et al.  Keyword Spotting in Document Images through Word Shape Coding , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  David S. Doermann,et al.  The Indexing and Retrieval of Document Images: A Survey , 1998, Comput. Vis. Image Underst..

[6]  Jonathan J. Hull Document matching on CCITT Group 4 compressed images , 1997, Electronic Imaging.

[7]  Yue Lu,et al.  Document retrieval from compressed images , 2003, Pattern Recognit..

[8]  Bidyut Baran Chaudhuri,et al.  Automatic Detection of Font Size Straight from Run Length Compressed Text Documents , 2014, ArXiv.

[9]  Bidyut Baran Chaudhuri,et al.  Automatic extraction of correlation-entropy features for text document analysis directly in run-length compressed domain , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[10]  Bidyut Baran Chaudhuri,et al.  A direct approach for word and character segmentation in run-length compressed documents with an application to word spotting , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[11]  Jonathan J. Hull,et al.  Document image similarity and equivalence detection , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12]  Dhavachelvan Ponnurangam,et al.  A survey of keyword spotting techniques for printed document images , 2010, Artificial Intelligence Review.