Unsupervised Word Clustering Using Deep Features

Digitization is crucial especially in the Indian context. OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Even word based recognition approaches suffer from the issues such as time degradations, word segmentation errors, font style/size variations. In this paper, we propose a deep learning architecture based approach for unsupervised word clustering. An edge responsive untrained Convolutional Neural Network (CNN) is used as a feature extractor. Graph connected component analysis is applied on the similarity graph computed from the word features. Our approach inherently detects similar shape patterns at word level and hence, it is language agnostic. We validated our approach against multiple state of art word matching techniques. Experimental results show that our approach significantly outperforms all of them on variety of data sets. In addition, the approach is observed to be robust to word segmentation errors, font style/size variations.

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  David D. Cox,et al.  A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation , 2009, PLoS Comput. Biol..

[3]  Martial Hebert,et al.  Object Recognition by a Cascade of Edge Probes , 2002, BMVC.

[4]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[5]  C. V. Jawahar,et al.  Nearest neighbor based collection OCR , 2010, DAS '10.

[6]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[8]  C. V. Jawahar,et al.  Searching in Document Images , 2004, ICVGIP.

[9]  A. G. Ramakrishnan,et al.  Word level multi-script identification , 2008, Pattern Recognit. Lett..

[10]  Xudong Jiang,et al.  LBP-Based Edge-Texture Features for Object Recognition , 2014, IEEE Transactions on Image Processing.

[11]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[12]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[14]  Cordelia Schmid,et al.  Shape recognition with edge-based features , 2003, BMVC.

[15]  R. Manmatha,et al.  Holistic word recognition for handwritten historical documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[16]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[18]  Fumitaka Kimura,et al.  Shape Code Based Word-Image Matching for Retrieval of Indian Multi-lingual Documents , 2010, 2010 20th International Conference on Pattern Recognition.