Word mining in a sparsely labeled handwritten collection

Word-spotting techniques are usually based on detailed modeling of target words, followed by search for the locations of such a target word in images of handwriting. In this study, the focus is on deciding for the presence of target words in lines of text, regardless and disregarding their horizontal position. Line strips are modeled using a Bag-of-Glyphs approach using a self-organized map. This approach uses the presence of fragmented-connected component shapes (glyphs) in a line strip to characterize this text passage, similar to the Bag-of-Words approach for 'ASCII'-encoded documents in regular Information Retrieval. Subsequently, the presence of a word or word category is trained to a support-vector machine in an iterative setup which involves an active group of users. Results are promising for a large proportion of words and are dependent both on the amount of labeled lines as well as shape uniqueness. Particularly useful is the ability to train on abstract content classes such as proper names, municipalities or word-bigram presence in the line-strip images.