Flexible Sequence Matching Technique: Application to Word Spotting in Degraded Documents

In this paper, a new sequence-matching algorithm, called as Flexible Sequence Matching (FSM) algorithm is proposed. FSM combines several abilities of other sequence matching algorithms (especially DTW, CDP and MVM) that could be configured depending on the application domain. Its generality and robustness comes from its ability to find sub sequences (as in CDP), to skip outliers inside the match sequences (as in MVM) and to match multiple elements with a single one (as in CDP and DTW). These properties make it extremely suitable for robust word spotting. More precisely, the FSM algorithm has the capability to retrieve a query inside a line or piece of line. This facility is useful as word segmentation process may not work accurately or when only line segmentation information is available. Furthermore, thanks to its skipping capability, that makes the proposed FSM algorithm less sensible to local variations in the spelling of words, and also to local degradation effects. Finally, its multiple matching facilities (many to one and one to many matching) are useful in case of different length of target and query sequences due to the variability in scale factor. We demonstrate the superiority of proposed FSM algorithm in specific cases such as incorrect word segmentation and word level local variations. When different experiments were performed using handwritten George Washington dataset and also on historical typewritten document images, quite promising results were obtained.

[1]  Nicole Vincent,et al.  Word spotting in historical printed documents using shape and sequence comparisons , 2012, Pattern Recognit..

[2]  Yuzuru Tanaka,et al.  Slit Style HOG Feature for Document Image Word Spotting , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[3]  R. Manmatha,et al.  Word spotting for historical documents , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  Jim Woodcock,et al.  A Weakest Precondition Semantics for Z , 1998, Comput. J..

[5]  Qiang Wang,et al.  Optimal Subsequence Bijection , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  José A. Rodríguez-Serrano,et al.  A Model-Based Sequence Similarity with Application to Handwritten Word Spotting , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Andreas Keller,et al.  HMM-based Word Spotting in Handwritten Documents Using Subword Models , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Josep Lladós,et al.  The role of the users in handwritten word spotting applications: query fusion and relevance feedback , 2012, 2012 International Conference on Frontiers in Handwriting Recognition.

[9]  Sonia Garcia-Salicetti,et al.  Dynamic Time Warping (DTW) , 2009, Encyclopedia of Biometrics.

[10]  Josep Lladós,et al.  Browsing Heterogeneous Document Collections by a Segmentation-Free Word Spotting Method , 2011, 2011 International Conference on Document Analysis and Recognition.

[11]  José A. Rodríguez-Serrano,et al.  Handwritten word-spotting using hidden Markov models and universal vocabularies , 2009, Pattern Recognit..

[12]  Ioannis Pratikakis,et al.  Segmentation-free Word Spotting in Historical Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[13]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[14]  Qiang Wang,et al.  An elastic partial shape matching technique , 2007, Pattern Recognit..

[15]  Volkmar Frinken,et al.  A Novel Word Spotting Algorithm Using Bidirectional Long Short-Term Memory Neural Networks , 2010, ANNPR.

[16]  Frank Lebourgeois,et al.  Towards an omnilingual word retrieval system for ancient manuscripts , 2009, Pattern Recognit..

[17]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[18]  Ryuichi Oka Spotting Method for Classification of Real World Data , 1998, Comput. J..