Word shape descriptor-based document image indexing: a new DBH-based approach

In this paper, we propose a novel feature representation for binary patterns by exploiting the object shape information. Initial evaluation of the representation is performed for Bengali and Gujarati script character classification. The extension of the representation for word images is presented subsequently. The proposed feature representation in combination with distance-based hashing is applied for defining novel word image-based document image indexing and retrieval framework. The concept of hierarchical hashing is utilized to reduce the retrieval time complexity. In addition, with the objective of reduction in the size of hashing data structure, the concept of multi-probe hashing is extended for binary mapping functions. The exhaustive experimental evaluation of the proposed framework on a collection of documents belonging to Devanagari, Bengali and English scripts has yielded encouraging results.

[1]  Luc Vincent,et al.  Google Book Search: Document Understanding on a Massive Scale , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Hiroyuki Kitagawa,et al.  Querying XML Data using PC Cluster System , 2007 .

[3]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Josep Lladós,et al.  Indexing Historical Documents by Word Shape Signatures , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[5]  Santanu Chaudhury,et al.  Signature verification using multiple neural classifiers , 1997, Pattern Recognit..

[6]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[7]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[8]  Wei-Ying Ma,et al.  Locality preserving indexing for document representation , 2004, SIGIR '04.

[9]  Giovanni Soda,et al.  Tree clustering for layout-based document image retrieval , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[10]  Özgür Ulusoy,et al.  Content-based retrieval of historical Ottoman documents stored as textual images , 2004, IEEE Transactions on Image Processing.

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[14]  Martial Hebert,et al.  Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  David S. Doermann,et al.  The retrieval of document images: a brief survey , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[16]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[18]  Takehiro Nakayama Content-Oriented Categorization of Document Images , 1996, COLING.

[19]  Alan F. Smeaton,et al.  Word matching using single closed contours for indexing handwritten historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[20]  Venu Govindaraju,et al.  The Role of Holistic Paradigms in Handwritten Word Recognition , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Karl Aberer,et al.  Distributed similarity search in high dimensions using locality sensitive hashing , 2009, EDBT '09.

[22]  Nicolai Petkov,et al.  Distance sets for shape filters and shape recognition , 2003, IEEE Trans. Image Process..

[23]  Chew Lim Tan,et al.  Keyword Spotting in Document Images through Word Shape Coding , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[24]  Ying Liu,et al.  A survey of content-based image retrieval with high-level semantics , 2007, Pattern Recognit..

[25]  Shijian Lu,et al.  Document Image Retrieval through Word Shape Coding , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[27]  Peng-Yeng Yin,et al.  Pattern Recognition Techniques, Technology and Applications , 2008 .

[28]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[29]  Santanu Chaudhury,et al.  Use of MKL as symbol classifier for Gujarati character recognition , 2010, DAS '10.

[30]  C. V. Jawahar,et al.  Experiences of integration and performance testing of multilingual OCR for printed Indian scripts , 2011, MOCR_AND '11.

[31]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[32]  Julian R. Ullmann,et al.  Pattern recognition techniques , 1973 .

[33]  Dan S. Bloomberg,et al.  Word spotting in scanned images using hidden Markov models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[35]  Wang Weihong,et al.  A Scalable Content-based Image Retrieval Scheme Using Locality-sensitive Hashing , 2009, 2009 International Conference on Computational Intelligence and Natural Computing.

[36]  Nieves R. Brisaboa,et al.  A New Approach for Document Indexing UsingWavelet Trees , 2007 .

[37]  Josep Llad Indexing Historical Documents by Word Shape Signatures , 2007 .

[38]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[39]  T. Syeda-Mahmood Indexing of handwritten document images , 1997, Proceedings Workshop on Document Image Analysis (DIA'97).

[40]  Edward M. Riseman,et al.  Indexing handwriting using word matching , 1996, DL '96.

[41]  Haiying Shen,et al.  An Efficient Similarity Searching Scheme in Massive Databases , 2008, 2008 The Third International Conference on Digital Telecommunications (icdt 2008).

[42]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[43]  Giovanni Soda,et al.  Font adaptive word indexing of modern printed documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).