Vertical bar detection for gauging text similarity of document images

A new method for gauging text similarity of image-based documents using word shape recognition is proposed in this paper. Image features are directly extracted instead of using OCR (optical character recognition). The proposed method forms so-called vertical bar patterns by detecting local extrema points in word units extracted by segmenting the document images. These vertical bar patterns form the feature vector of a document. The pair-wise similarity of document images is measured by calculating the scalar product of two document feature vectors. The proposed method is robust to changing fonts and styles, and is less affected by degradation of document qualities. To test the validity of the method, four corpora of document images were used and the ability of the method to retrieve relevant documents is reported.