Document Image Indexing Using Edit Distance Based Hashing

We present a novel word image based document indexing scheme by combination of string matching and hashing. The word image representation is defined by string codes obtained by unsupervised learning over graphical primitives. The indexing framework is defined by distance based hashing function which does the object projection to hash space by preserving their distances. We have used edit distance based string matching for defining the hashing function and for approximate nearest neighbor based retrieval. The application of the proposed indexing framework is presented for two document image collections belonging to Devanagari and Bengali script.

[1]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Takehiro Nakayama Content-Oriented Categorization of Document Images , 1996, COLING.

[3]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[4]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[5]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Santanu Chaudhury,et al.  Use of MKL as symbol classifier for Gujarati character recognition , 2010, DAS '10.

[8]  Martial Hebert,et al.  Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dan S. Bloomberg,et al.  Word spotting in scanned images using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Giovanni Soda,et al.  Font adaptive word indexing of modern printed documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Chew Lim Tan,et al.  Keyword Spotting in Document Images through Word Shape Coding , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  T. Syeda-Mahmood Indexing of handwritten document images , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[15]  Santanu Chaudhury,et al.  Signature verification using multiple neural classifiers , 1997, Pattern Recognit..

[16]  Wang Weihong,et al.  A Scalable Content-based Image Retrieval Scheme Using Locality-sensitive Hashing , 2009, 2009 International Conference on Computational Intelligence and Natural Computing.

[17]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[18]  Haiying Shen,et al.  An Efficient Similarity Searching Scheme in Massive Databases , 2008, 2008 The Third International Conference on Digital Telecommunications (icdt 2008).

[19]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[20]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..