Concept Disambiguation for Improved Subject Access Using Multiple Knowledge Sources

We address the problem of mining text for relevant image metadata. Our work is situated in the art and architecture domain, where highly specialized technical vocabulary presents challenges for NLP techniques. To extract high quality metadata, the problem of word sense disambiguation must be addressed in order to avoid leading the searcher to the wrong image as a result of ambiguous — and thus faulty — metadata. In this paper, we present a disambiguation algorithm that attempts to select the correct sense of nouns in textual descriptions of art objects, with respect to a rich domain-specific thesaurus, the Art and Architecture Thesaurus (AAT). We performed a series of intrinsic evaluations using a data set of 600 subject terms extracted from an online National Gallery of Art (NGA) collection of images and text. Our results showed that the use of external knowledge sources shows an improvement over a baseline.