Matching Handwritten Document Images

We address the problem of predicting similarity between a pair of handwritten document images written by potentially different individuals. This has applications related to matching and mining in image collections containing handwritten content. A similarity score is computed by detecting patterns of text re-usages between document images irrespective of the minor variations in word morphology, word ordering, layout and paraphrasing of the content. Our method does not depend on an accurate segmentation of words and lines. We formulate the document matching problem as a structured comparison of the word distributions across two document images. To match two word images, we propose a convolutional neural network (cnn) based feature descriptor. Performance of this representation surpasses the state-of-the-art on handwritten word spotting. Finally, we demonstrate the applicability of our method on a practical problem of matching handwritten assignments.

[1]  Ernest Valveny,et al.  Handwritten Word Spotting with Corrected Attributes , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  C. V. Jawahar,et al.  Document Specific Sparse Coding for Word Retrieval , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[7]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  RusiñolMarçal,et al.  Efficient segmentation-free keyword spotting in historical document collections , 2015 .

[10]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[15]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[16]  José A. Rodríguez-Serrano,et al.  Fisher Kernels for Handwritten Word-spotting , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[19]  Masakazu Iwamura,et al.  Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH , 2011, 2011 International Conference on Document Analysis and Recognition.

[20]  Alicia Fornés,et al.  Sequential Word Spotting in Historical Handwritten Documents , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[21]  C. V. Jawahar,et al.  Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[23]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[24]  José A. Rodríguez-Serrano,et al.  A Model-Based Sequence Similarity with Application to Handwritten Word Spotting , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[26]  Rohit Prasad,et al.  Detecting near-duplicate document images using interest point matching , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[27]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[28]  Ernest Valveny,et al.  Segmentation-free word spotting with exemplar SVMs , 2014, Pattern Recognit..

[29]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[32]  R. Manmatha,et al.  An Efficient Framework for Searching Text in Noisy Document Images , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[33]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Alireza Alaei,et al.  ICDAR 2013 Handwriting Segmentation Contest , 2009, 2013 12th International Conference on Document Analysis and Recognition.

[35]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[36]  Alicia Fornés,et al.  Contextual Word Spotting in Historical Handwritten Documents , 2014 .

[37]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[38]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[39]  Venu Govindaraju,et al.  Handwriting analysis of pre-hospital care reports , 2004, Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems.

[40]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[41]  Josep Lladós,et al.  Efficient segmentation-free keyword spotting in historical document collections , 2015, Pattern Recognit..

[42]  Apostolos Antonacopoulos,et al.  Handwriting Segmentation Contest , 2007, ICDAR.

[43]  Masakazu Iwamura,et al.  Real-Time Document Image Retrieval on a Smartphone , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[44]  C. V. Jawahar,et al.  Detection of Cut-and-Paste in Document Images , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[45]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[46]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[47]  Ioannis Pratikakis,et al.  Text line and word segmentation of handwritten documents , 2009, Pattern Recognit..

[48]  Andreas Keller,et al.  Lexicon-free handwritten word spotting using character HMMs , 2012, Pattern Recognit. Lett..

[49]  Edward M. Riseman,et al.  Word spotting: a new approach to indexing handwriting , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[51]  Rohini K. Srihari,et al.  Automatic scoring of short handwritten essays in reading comprehension tests , 2008, Artif. Intell..

[52]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[53]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.