HLDA based text clustering

LDA (Latent Dirichlet Allocation) topic model has been applied into many applications in recent years. But LDA has a shortcoming that it cannot deal with various changes of data set well, which has become a limitation for its applications. Hierarchical Latent Dirichlet Allocation (hLDA) is a generalization of LDA and it can adapt itself to the growing data set automatically. hLDA can mine latent topics from a large amount of discrete data and organize these topics into a hierarchy, in which the topics of higher level are more abstractive while the topics of lower level are more specific. This hierarchy could achieve a deeper semantic model which is similar with human mind. Given a set of documents, hLDA generates a prior distribution of Bayesian nonparametrics using a nested Chinese restaurant process (nCRP)[1]. The documents sharing similar topics are organized into a cluster of path. hLDA learns the distribution of topics using a method of Bayesian posterior inference. This paper tries to study hLDA model in details and apply it into the application of Chinese text clustering. Experiments have shown that hLDA is a very promising model for text clustering.