Locality Sensitive Pseudo-Code for Document Images

In this paper, we propose a novel scheme for representing character string images in the scanned document. We converted conventional multi-dimensional descriptors into pseudo-codes which have a property that: if two vectors are near in the original space then encoded pseudo-codes are 'semi equivalent with high probability. For this conversion, we combined locality sensitive hashing (LSH) indices and at the same time we also developed a new family of LSH functions that is superior to earlier ones when all vectors are constrained to lie on the surface of the unit sphere. Word spotting based on our pseudo-code becomes faster than multi-dimensional descriptor-based method while it scarcely degrades the accuracy.

[1]  Gernot A. Fink,et al.  On appearance-based feature extraction methods for writer-independent handwritten text recognition , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[2]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Kengo Terasawa,et al.  Eigenspace method for text retrieval in historical document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4]  Horst Bunke,et al.  Hidden Markov model length optimization for handwriting recognition systems , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[5]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[6]  Ryuichi Oka Spotting Method for Classification of Real World Data , 1998, Comput. J..

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  K. Takahashi,et al.  Transmedia machine , 1989 .

[9]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[10]  Giovanni Soda,et al.  Indexing and retrieval of words in old documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Horst Bunke,et al.  Handwritten sentence recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[12]  Janet L. Kolodner,et al.  Indexing and Retrieval , 1993 .

[13]  Yuzuru Tanaka,et al.  Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007, WADS.

[14]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).