Word shape recognition for image-based document retrieval

We propose a word shape recognition method for retrieving image-based documents. Document images are segmented at the word level first. Then the proposed method detects local extrema points in word segments to form so-called vertical bar patterns. These vertical bar patterns form the feature vector of a document. The scalar product of two document feature vectors is calculated to measure the pairwise similarity of document images. The proposed method is robust to changing fonts and styles, and is less affected by degradation of document qualities. Three groups of words in different fonts and image qualities were used to test the validity of our method. Real-life document images were also used to test the method's ability of retrieving relevant documents.

[1]  Chew Lim Tan,et al.  Text Retrieval from Document Images based on N-Gram Algorithm , 2000, PRICAI Workshop on Text and Web Mining.

[2]  Siu Cheung Hui,et al.  Cursive word reference line detection , 1997, Pattern Recognit..

[3]  Ulrich Kressel,et al.  Segmenting merged characters , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[4]  Nasser Sherkat,et al.  Word shape analysis for a hybrid recognition system , 1997, Pattern Recognit..

[5]  Venu Govindaraju,et al.  Local reference lines for handwritten phrase recognition , 1999, Pattern Recognit..

[6]  Bernadette Dorizzi,et al.  On-line cursive script recognition: A user-adaptive system for word identification , 1996, Pattern Recognit..