Language Independent Keyword Based Information Retrieval System of Handwritten Documents using SVM Classifier and Converting Words into Shapes

This work presents a language independent keyword based document indexing and retrieval system using SVM as classifier. Word spotting presents an attractive alternative to the traditional Optical Character Recognition (OCR) systems where instead of converting the image into text, retrieval is based on matching the images of words using pattern classification techniques. The proposed technique relies on extracting words from images of handwritten documents and converting each word image into a shape represented by its contour. A set of multiple features is then extracted from each word image and instances of same words are grouped into clusters. These clusters are used to train a multi-class SVM which learns different word classes. The documents to be indexed are segmented into words and the closest cluster for each word is determined using the SVM. An index file is maintained for each word containing the word locations within each document. A query word presented to the system is matched with the clusters in the database and the documents containing occurrences of the query word are retrieved. The system realized promising precision and recall rates on the IAM database of handwritten documents.

[1]  Karin Wall,et al.  A fast sequential method for polygonal approximation of digitized curves , 1984, Comput. Vis. Graph. Image Process..

[2]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[3]  Andreas Keller,et al.  HMM-based Word Spotting in Handwritten Documents Using Subword Models , 2010, 2010 20th International Conference on Pattern Recognition.

[4]  Alessandro Vinciarelli,et al.  A survey on off-line Cursive Word Recognition , 2002, Pattern Recognit..

[5]  Nicole Vincent,et al.  Word spotting in historical printed documents using shape and sequence comparisons , 2012, Pattern Recognit..

[6]  R. Manmatha,et al.  Word spotting for historical documents , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Vladimir Kluzner,et al.  Word-Based Adaptive OCR for Historical Books , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[8]  Alan F. Smeaton,et al.  Word matching using single closed contours for indexing handwritten historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[9]  Nicole Vincent,et al.  A Set of Chain Code Based Features for Writer Recognition , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[10]  Volkmar Frinken,et al.  Improving HMM-Based Keyword Spotting with Character Language Models , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[11]  Labiba Souici-Meslati,et al.  Automatic analysis of handwriting for gender classification , 2014, Pattern Analysis and Applications.

[12]  Lianhong Cai,et al.  Improved keyword spotting system by optimizing posterior confidence measure vector using feed-forward neural network , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[13]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[14]  Jean-Yves Ramel,et al.  A Two-Stage Approach for Word Spotting in Graphical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[15]  C. V. Jawahar,et al.  Document Retrieval with Unlimited Vocabulary , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[16]  Frank Lebourgeois,et al.  Textual indexation of ancient documents , 2005, DocEng '05.

[17]  Nicole Vincent,et al.  A Novel Approach for Word Spotting Using Merge-Split Edit Distance , 2009, CAIP.

[18]  Imran Siddiqi,et al.  Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach , 2011, 2011 International Conference on Document Analysis and Recognition.

[19]  Venu Govindaraju,et al.  Script Independent Word Spotting in Offline Handwritten Documents Based on Hidden Markov Models , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[20]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Thierry Paquet,et al.  A writer identification and verification system , 2005, Pattern Recognit. Lett..

[22]  Ioannis Pratikakis,et al.  Segmentation-free Word Spotting in Historical Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[23]  Nicole Vincent,et al.  Text independent writer recognition using redundant writing patterns with contour-based orientation and curvature features , 2010, Pattern Recognit..

[24]  Josep Lladós,et al.  Word and Symbol Spotting Using Spatial Organization of Local Descriptors , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[25]  Venu Govindaraju,et al.  A Bayesian Approach to Script Independent Multilingual Keyword Spotting , 2014, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[26]  Stavros J. Perantonis,et al.  A Complete Optical Character Recognition Methodology for Historical Documents , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[27]  Andrey Andreev,et al.  Word Image Matching Based on Hausdorff Distances , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[28]  Brijesh Verma,et al.  A novel feature extraction technique for the recognition of segmented handwritten characters , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[29]  Sargur N. Srihari,et al.  On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Yi Lu,et al.  Character segmentation in handwritten words - An overview , 1996, Pattern Recognit..

[31]  Y. Miyake,et al.  Machine and human recognition of segmented characters from handwritten words , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[32]  José A. Rodríguez-Serrano,et al.  A Model-Based Sequence Similarity with Application to Handwritten Word Spotting , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jilin Li,et al.  Document Image Retrieval with Local Feature Sequences , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[34]  Yuzuru Tanaka,et al.  Automatic Evaluation Framework for Word Spotting , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[35]  Brijesh Verma,et al.  An investigation of the modified direction feature for cursive character recognition , 2007, Pattern Recognit..

[36]  Volkmar Frinken,et al.  Adapting BLSTM Neural Network Based Keyword Spotting Trained on Modern Data to Historical Documents , 2010, 2010 12th International Conference on Frontiers in Handwriting Recognition.

[37]  Horst Bunke,et al.  Shape Code Based Lexicon Reduction for Offline Handwritten Word Recognition , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[38]  Hiroshi Yamada,et al.  Cursive handwritten word recognition using multiple segmentation determined by contour analysis , 1996 .

[39]  Giovanni Soda,et al.  Efficient Word Retrieval by Means of SOM Clustering and PCA , 2006, Document Analysis Systems.

[40]  Christodoulos Chamzas,et al.  Web Document Image Retrieval System Based on Word Spotting , 2006, 2006 International Conference on Image Processing.

[41]  Shaolei Feng,et al.  Using Corner Feature Correspondences to Rank Word Images by Similarity , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[42]  Nasser Sherkat,et al.  Handwriting style classification , 2003, Document Analysis and Recognition.

[43]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Chew Lim Tan,et al.  Keyword Spotting in Document Images through Word Shape Coding , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[45]  Mohamed Cheriet,et al.  Application of Multi-Level Classifiers and Clustering for Automatic Word Spotting in Historical Document Images , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[46]  Nicole Vincent,et al.  Feature-based Word Spotting in Ancient Printed Documents , 2008, PRIS.