Probabilistic topic models for information retrieval and concept modeling

Statistical topic models are a class of probabilistic latent variable models for textual data that represent text documents as distributions over topics. These models have been shown to produce interpretable summarization of documents in the form of topics. In this dissertation, we investigate how the statistical topic modeling framework can be used for information retrieval tasks and for the integration of background knowledge in the form of semantic concepts. We first describe the special-words topic models in which a document is represented as a distribution of (i) a mixture of shared topics, (ii) a special-words distribution specific to the document, and (iii) a corpus-level background distribution. We describe the utility of the special-words topic models for information retrieval tasks and illustrate a variation of the model for metadata enhancement of digital libraries with multiple corpora. We next investigate the problem of integrating background knowledge in the form of semantic concepts into the topic modeling framework. To combine data-driven topics and semantic concepts, we propose the concept-topic model which represents a document as a distribution over data-driven topics and semantic concepts. We extend this model to the hierarchical concept-topic model to incorporate concept hierarchies into the modeling framework. For all these models, we develop learning algorithms and demonstrate their utility with experiments conducted on real-world data sets.