LSH-based large scale chinese calligraphic character recognition

Chinese calligraphy is the art of handwriting and is an important part of Chinese traditional culture. But due to the complexity of shape and styles of calligraphic characters, it is difficult for com-mon people to recognize them. So it would be great if a tool is provided to help users to recognize the unknown calligraphic characters. But the well-known OCR (Optical Character Recogni-tion) technology can hardly help people to recognize the unknown characters because of their deformation and complexity. Numerous collections of historical Chinese calligraphic works are digitized and stored in CADAL (China Academic Digital Associate Library) calligraphic system [1], and a huge database CCD (Calligraphic Character Dictionary) is built, which contains character images labeled with semantic meaning. In this paper, a LSH-based large scale Chinese calligraphic character recognition method is proposed basing on CCD. In our method, GIST descriptor is used to represent the global features of the calligraphic character images, LSH (Locality-sensitive hashing) is used to search CCD to find the similar character images to the recognized calligraphic character image. The recognition is based on the semantic probability which is computed according to the ranks of retrieved images and their distances to the recognized image in the Gist feature space. Our experiments show that our method is effective and efficient for recognizing Chinese calligraphic character image.

[1]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[2]  Ilan Shimshoni,et al.  Mean shift based clustering in high dimensions: a texture classification example , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Cheng Yang MACS: music audio characteristic sequence indexing for similarity retrieval , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[5]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[6]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[7]  Daming Shi,et al.  Offline handwritten Chinese character recognition by radical decomposition , 2003, TALIP.

[8]  Klara Kedem,et al.  Classification of Hebrew calligraphic handwriting styles: preliminary results , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[9]  Yueting Zhuang,et al.  Skeleton-Based Recognition of Chinese Calligraphic Character Image , 2008, PCM.

[10]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[11]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[12]  Jeremy Buhler,et al.  Provably sensitive Indexing strategies for biosequence similarity search , 2002, RECOMB '02.

[13]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[14]  Hector Garcia-Molina,et al.  Detecting Digital Copyright Violations On The Internet , 1999 .

[15]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[16]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[17]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[18]  R. Manmatha,et al.  Indexing for a Digital Library of George Washington’s Manuscripts: A Study of Word Matching Techniques , 2002 .

[19]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[20]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[21]  Ehud Rivlin,et al.  Applying algebraic and differential invariants for logo recognition , 1996, Machine Vision and Applications.

[22]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.