A keyword retrieval system for historical Mongolian document images

In this paper, we propose a keyword retrieval system for locating words in historical Mongolian document images. Based on the word spotting technology, a collection of historical Mongolian document images is converted into a collection of word images by word segmentation, and a number of profile-based features are extracted to represent word images. For each word image, a fixed-length feature vector is formulated by obtaining the appropriate number of the complex coefficients of discrete Fourier transform on each profile feature. The system supports online image-to-image matching by calculating similarities between a query word image and each word image in the collection, and consequently, a ranked result is returned in descending order of the similarities. Therein, the query word image can be generated by synthesizing a sequence of glyphs when being retrieved. By experimental evaluations, the performance of the system is confirmed.

[1]  Guanglai Gao,et al.  A Method for Removing Inflectional Suffixes in Word Spotting of Mongolian Kanjur , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  Guanglai Gao,et al.  Machine-Printed Traditional Mongolian Characters Recognition Using BP Neural Networks , 2009, 2009 International Conference on Computational Intelligence and Software Engineering.

[3]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[4]  Guanglai Gao,et al.  Classical Mongolian Words Recognition in Historical Document , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Ken’iti Kido Discrete Fourier Transform , 2015 .

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[8]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[9]  Jihad El-Sana,et al.  Keyword Searching for Arabic Handwritten Documents , 2008 .

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  R. Manmatha,et al.  A search engine for historical manuscript images , 2004, SIGIR '04.

[12]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[13]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[14]  Nikos Papamarkos,et al.  A Document Image Retrieval System , 2010, Eng. Appl. Artif. Intell..

[15]  Imran Siddiqi,et al.  Towards Searchable Digital Urdu Libraries - A Word Spotting Based Retrieval Approach , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Josef Kittler,et al.  Minimum error thresholding , 1986, Pattern Recognit..

[17]  Hua Wang,et al.  Multi-font printed Mongolian document recognition system , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Jihad El-Sana,et al.  Case Study in Hebrew Character Searching , 2011, 2011 International Conference on Document Analysis and Recognition.

[19]  Zaher Al Aghbari,et al.  HAH manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents , 2009, Expert Syst. Appl..

[20]  Edward M. Riseman,et al.  Indexing handwriting using word matching , 1996, DL '96.

[21]  Sergios Theodoridis,et al.  Keyword-guided word spotting in historical printed documents using synthetic data and user feedback , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[22]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[23]  R. Manmatha,et al.  Features for word spotting in historical manuscripts , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[24]  Guanglai Gao,et al.  An efficient binarization method for ancient Mongolian document images , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[25]  Jihad El-Sana,et al.  Word spotting for handwritten documents using Chamfer Distance and Dynamic Time Warping , 2011, Electronic Imaging.

[26]  Kengo Terasawa,et al.  Eigenspace method for text retrieval in historical document images , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[27]  C. V. Jawahar,et al.  Efficient Search in Document Image Collections , 2007, ACCV.

[28]  Wei Li,et al.  MULTI-AGENT BASED RECOGNITION SYSTEM OF PRINTED MONGOLIAN CHARACTERS , 2003 .