In this paper we focus on the task of identifying topics in large text collections in a completely unsupervised way. In contrast to probabilistic topic modeling methods that require first estimating the density of probability distributions, we model topics as subsets of terms that are used as queries to an index of documents. By retrieving the documents relevant to those topical-queries we obtain overlapping clusters of semantically similar documents. In order to find the topical-queries we generate candidate queries using signature-calculation heuristics such as those used on duplicate-detection methods and then evaluate candidates using an information-gain function defined as "semantic force". The method is targeted to the semantic analysis of collections sized in the order of millions of documents, so, it has been implemented in map-reduce style. We present some initial results to support the feasibility of the approach.
[1]
T. Landauer,et al.
Indexing by Latent Semantic Analysis
,
1990
.
[2]
Max Welling,et al.
Distributed Inference for Latent Dirichlet Allocation
,
2007,
NIPS.
[3]
Djoerd Hiemstra,et al.
Exploring Topic-based Language Models for Effective Web Information Retrieval
,
2008
.
[4]
Mark Steyvers,et al.
Topics in semantic representation.
,
2007,
Psychological review.
[5]
Thomas Hofmann,et al.
Probabilistic latent semantic indexing
,
1999,
SIGIR '99.
[6]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[7]
Andreas Paepcke,et al.
SpotSigs: robust and efficient near duplicate detection in large web collections
,
2008,
SIGIR '08.