Information Retrieval from Historical Document Image Base

This communication presents an effective method for information retrieval from historical document image base. Proposed approach is based on word and character extraction in the text and attributing certain feature vectors to each of the character images. Words are matched by comparing their characters through a multistage Dynamic Time warping (DTW) stage on the extracted feature set. The approach exhibits extremely promising results reading more than 96% retrieval/recognition rate.

[1]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[2]  Nicole Vincent,et al.  Comparison of Niblack inspired binarization methods for ancient documents , 2009, Electronic Imaging.

[3]  Henry S. Baird Difficult and urgent open problems in document image analysis for libraries , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[4]  Nicole Vincent,et al.  A Novel Approach for Word Spotting Using Merge-Split Edit Distance , 2009, CAIP.

[5]  Eamonn J. Keogh,et al.  Derivative Dynamic Time Warping , 2001, SDM.

[6]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Shaolei Feng,et al.  Using Corner Feature Correspondences to Rank Word Images by Similarity , 2003, 2003 Conference on Computer Vision and Pattern Recognition Workshop.

[8]  Nicole Vincent,et al.  Feature-based Word Spotting in Ancient Printed Documents , 2008, PRIS.

[9]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[10]  Arun K. Pujari,et al.  An Adaptive Character Recognizer for Telugu Scripts Using Multiresolution Analysis, Associative Memory , 2002, ICVGIP.

[11]  Yan Chen,et al.  Comparison of some thresholding algorithms for text/background segmentation in difficult document images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..