论文信息 - The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

This paper presents a novel statistical latent class model for text mining and interactive information access. The described learning architecture, called Cluster-Abstraction Model (CAM), is purely data driven and utilizes contact-specific word occurrence statistics. In an intertwined fashion, the CAM extracts hierarchical relations between groups of documents as well as an abstractive organization of keywords. An annealed version of the Expectation-Maximization (EM) algorithm for maximum likelihood estimation of the model parameters is derived. The benefits of the CAM for interactive retrieval and automated cluster summarization are investigated experimentally.

Thomas Hofmann | Thomas Hofmann

[1] Susan Brewer,et al. Information storage and retrieval , 1959, ACM '59.

[2] C. J. van Rijsbergen,et al. The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[3] W. Bruce Croft. Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[4] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[6] Peter Willett,et al. Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[7] Rose,et al. Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[8] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[9] Michael I. Jordan,et al. Learning in Boltzmann Trees , 1994, Neural Computation.

[10] Thomas Hofmann,et al. Statistical Models for Co-occurrence Data , 1998 .

[11] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.