Expanding the taxonomies of bibliographic archives with persistent long-term themes

As document collections accummulate over time, some of the discussion subjects in them become outfashioned, while new ones emerge. In this paper, we address the challenge of finding such emerging and persistent "themes", i.e. subjects that live long enough to be incorporated into a taxonomy or ontology describing the document collection. Our method is based on similarity-based clustering and cluster label construction and focusses on the identification of cluster labels that "survive" changes in the constitution of the underlying population of documents, including changes in the feature space of dominant words. We conducted a set of promising experiments on the identification of themes that manifested themselves in the ACM library within the last decade.