Cluster Data Streams with Noisy Variables

Clustering algorithms are important methods widely used in mining data streams because of their abilities to deal with infinite data flows. Although these algorithms perform well to mining latent relationship in data streams, most of them suffer from loss of cluster purity and become unstable when the inputting data streams have too many noisy variables. In this article, we propose a clustering algorithm to cluster data streams with noisy variables. The result from simulation shows that our proposal method is better than previous studies by adding a process of variable selection as a component in clustering algorithms. The results of two experiments indicate that clustering data streams with the process of variable selection are more stable and have better purity than those without such process. Another experiment testing KDD-CUP99 dataset also shows that our algorithm can generate more stable result.

[1]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[2]  Jian Huang,et al.  Penalized feature selection and classification in bioinformatics , 2008, Briefings Bioinform..

[3]  Teh Ying Wah,et al.  Density Micro-Clustering Algorithms on Data Streams: A Review , 2011 .

[4]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[5]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[6]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[7]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[8]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[9]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[10]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[11]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[12]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[13]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[16]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[17]  Chen Jia,et al.  A Grid and Density-Based Clustering Algorithm for Processing Data Stream , 2008, 2008 Second International Conference on Genetic and Evolutionary Computing.

[18]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[19]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Jing Gao,et al.  An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection , 2005, PAKDD.

[21]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[22]  Ying Wah Teh,et al.  A study of density-grid based clustering algorithms on data streams , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[23]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[24]  Mohamed Medhat Gaber,et al.  Data Stream Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[25]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[26]  Josh R Fernandez,et al.  CluSandra: A Framework and Algorithm for Data Stream Cluster Analysis , 2011 .