Mining underlying correlated-clusters in high-dimensional data streams

High-dimensional data streams pose challenges to traditional clustering algorithm, due to their inherent sparsity and data tend to cluster in different subspaces of the entire feature space. In this paper, we resolve the subspace clustering problem by mining correlated-clusters, in which selected features are correlated with each other. Moreover, taking data evolution in data streams into account, we propose methods to mine correlations of features incrementally and adaptively. At each time tick t, according to our proposed multiple regression measurement, we cluster the newly arrived data sample to one of correlated-clusters whose local correlations fit to the data sample and also update the local correlations adaptively, based on an incremental principal component analysis technology. The results of experiments on high-dimensional synthetic data and real data demonstrate that our methods can achieve higher accuracy of query than related work and perform much more efficiently. Additionally, our proposed methods are able to forecast missing values in streaming data successfully.

[1]  Michael Stonebraker,et al.  Aurora: a new model and architecture for data stream management , 2003, The VLDB Journal.

[2]  Christos Faloutsos,et al.  Quantifiable data mining using ratio rules , 2000, The VLDB Journal.

[3]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Yizhak Idan,et al.  Discovery of fraud rules for telecommunications—challenges and solutions , 1999, KDD '99.

[5]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[6]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[7]  Juyang Weng,et al.  Candid Covariance-Free Incremental Principal Component Analysis , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[9]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[10]  Qiang Yang,et al.  Mining Adaptive Ratio Rules from Distributed Data Sources , 2006, Data Mining and Knowledge Discovery.

[11]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[12]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[13]  Ales Leonardis,et al.  Incremental PCA for on-line visual learning and recognition , 2002, Object recognition supported by user interaction for service robots.

[14]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[15]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.