The popularity of blogs has been increasing dramatically over the last couple of years. As topics evolve in the blogosphere, keywords align together and form the heart of various stories. Intuitively we expect that in certain contexts, when there is a lot of discussion on a specific topic or event, a set of keywords will be correlated: the keywords in the set will frequently appear together (pair-wise or in conjunction) forming a cluster. Note that such keyword clusters are temporal (associated with specific time periods) and transient. As topics recede, associated keyword clusters dissolve, because their keywords no longer appear frequently together.
In this paper, we formalize this intuition and present efficient algorithms to identify keyword clusters in large collections of blog posts for specific temporal intervals. We then formalize problems related to the temporal properties of such clusters. In particular, we present efficient algorithms to identify clusters that persist over time. Given the vast amounts of data involved, we present algorithms that are fast (can efficiently process millions of blogs with multiple millions of posts) and take special care to make them efficiently realizable in secondary storage. Although we instantiate our techniques in the context of blogs, our methodology is generic enough to apply equally well to any temporally ordered text source.
We present the results of an experimental study using both real and synthetic data sets, demonstrating the efficiency of our algorithms, both in terms of performance and in terms of the quality of the keyword clusters and associated temporal properties we identify.
[1]
Divesh Srivastava,et al.
Flexible String Matching Against Large Databases in Practice
,
2004,
VLDB.
[2]
Hinrich Schütze,et al.
Book Reviews: Foundations of Statistical Natural Language Processing
,
1999,
CL.
[3]
Moni Naor,et al.
Optimal aggregation algorithms for middleware
,
2001,
PODS '01.
[4]
B. M. Brown,et al.
Practical Non-Parametric Statistics.
,
1981
.
[5]
Robert E. Tarjan,et al.
Graph Clustering and Minimum Cut Trees
,
2004,
Internet Math..
[6]
Nick Koudas,et al.
Searching the Blogosphere
,
2007,
WebDB.
[7]
Edward F. Grove,et al.
External-memory graph algorithms
,
1995,
SODA '95.
[8]
Venkatesan Guruswami,et al.
Correlation clustering with a fixed number of clusters
,
2005,
SODA '06.
[9]
George Karypis,et al.
Multilevel k-way Partitioning Scheme for Irregular Graphs
,
1998,
J. Parallel Distributed Comput..
[10]
Suresh Venkatasubramanian,et al.
On external memory graph traversal
,
2000,
SODA '00.
[11]
Nick Koudas,et al.
BlogScope: A System for Online Analysis of High Volume Text Streams
,
2007,
VLDB.