An Information-Theoretic Approach for Unsupervised Topic Mining in Large Text Collections

In this paper we focus on the task of identifying topics in large text collections in a completely unsupervised way. In contrast to probabilistic topic modeling methods that require first estimating the density of probability distributions, we model topics as subsets of terms that are used as queries to an index of documents. By retrieving the documents relevant to those topical-queries we obtain overlapping clusters of semantically similar documents. In order to find the topical-queries we generate candidate queries using signature-calculation heuristics such as those used on duplicate-detection methods and then evaluate candidates using an information-gain function defined as "semantic force". The method is targeted to the semantic analysis of collections sized in the order of millions of documents, so, it has been implemented in map-reduce style. We present some initial results to support the feasibility of the approach.