Clustering Textual Data by Latent Dirichlet Allocation: Applications and Extensions to Hierarchical Data

Latent Dirichlet Allocation is a generative probabilistic model that can be used to describe and analyse textual data. We extend the basic LDA model to search and classify a large set of administrative documents taking into account the structure of the textual data that show a clear hierarchy. This can be considered as a general approach to the analysis of short texts semantically linked to larger texts. Some preliminary empirical evidence that support the proposed model is presented.