The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream.
The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume.
This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offine component which uses only this summary statistics. The offine component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage, and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyramidal time frame in conjunction with a microclustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.
[1]
Ali S. Hadi,et al.
Finding Groups in Data: An Introduction to Chster Analysis
,
1991
.
[2]
Anil K. Jain,et al.
Algorithms for Clustering Data
,
1988
.
[3]
Paul S. Bradley,et al.
Scaling Clustering Algorithms to Large Databases
,
1998,
KDD.
[4]
Anne Rogers,et al.
Hancock: a language for extracting signatures from data streams
,
2000,
KDD '00.
[5]
Sudipto Guha,et al.
Clustering Data Streams
,
2000,
FOCS.
[6]
Jennifer Widom,et al.
Models and issues in data stream systems
,
2002,
PODS.
[7]
Hans-Peter Kriegel,et al.
OPTICS: ordering points to identify the clustering structure
,
1999,
SIGMOD '99.
[8]
Sudipto Guha,et al.
Streaming-data algorithms for high-quality clustering
,
2002,
Proceedings 18th International Conference on Data Engineering.
[9]
Sudipto Guha,et al.
CURE: an efficient clustering algorithm for large databases
,
1998,
SIGMOD '98.
[10]
Geoff Hulten,et al.
Mining high-speed data streams
,
2000,
KDD '00.
[11]
Charu C. Aggarwal,et al.
A framework for diagnosing changes in evolving data streams
,
2003,
SIGMOD '03.
[12]
Tian Zhang,et al.
BIRCH: an efficient data clustering method for very large databases
,
1996,
SIGMOD '96.
[13]
Jiawei Han,et al.
Efficient and Effective Clustering Methods for Spatial Data Mining
,
1994,
VLDB.
[14]
Charles Elkan,et al.
Scalability for clustering algorithms revisited
,
2000,
SKDD.