Soft Subspace Clustering Algorithm for Streaming Data

A key challenge to most conventional clustering algorithms in handling many real life problems is that data points in different clusters are often correlated with different subsets of features. To address this problem, subspace clustering has attracted increasing attention in recent years. However, the existing subspace clustering methods cannot be effectively applied to large-scale high dimensional data and data streams. In this study, the scalable clustering technique to subspace clustering is extend to form soft subspace clustering for streaming data. An entropy-weighting streaming subspace clustering algorithm, EWSSC is proposed. This method leverages on the effectiveness of fuzzy scalable clustering method for streaming data by revealing the important local subspace characteristics of high dimensional data. Substantial experimental results on both artificial and real-world datasets demonstrate that EWSSC is generally effective in clustering high dimensional streaming data.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3]  Erkki Oja,et al.  Rival penalized competitive learning for clustering analysis, RBF net, and curve detection , 1993, IEEE Trans. Neural Networks.

[4]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[5]  Lawrence O. Hall,et al.  Fast fuzzy clustering , 1998, Fuzzy Sets Syst..

[6]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[7]  Wei Li-mei Rival Checked Fuzzy C-Means Algorithm , 2000 .

[8]  Zhi-Qiang Liu,et al.  Self-splitting competitive learning: a new on-line clustering paradigm , 2002, IEEE Trans. Neural Networks.

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[11]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[12]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[13]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[14]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[15]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[16]  Yu Jian,et al.  Fuzzy Partitional Clustering Algorithms , 2004 .

[17]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[18]  Shi Zhong,et al.  Efficient streaming text clustering , 2005, Neural Networks.

[19]  Michael K. Ng,et al.  Subspace Clustering of Text Documents with Feature Weighting K-Means Algorithm , 2005, PAKDD.

[20]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[21]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[22]  Zijiang Yang,et al.  A Fuzzy Subspace Algorithm for Clustering High Dimensional Data , 2006, ADMA.

[23]  Lawrence O. Hall,et al.  Single Pass Fuzzy C Means , 2007, 2007 IEEE International Fuzzy Systems Conference.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  L. Hall,et al.  Creating Streaming Iterative Soft Clustering Algorithms , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[26]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[28]  Sun Ji,et al.  Clustering Algorithms Research , 2008 .

[29]  Jianhong Wu,et al.  A convergence theorem for the fuzzy subspace clustering (FSC) algorithm , 2008, Pattern Recognit..

[30]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[31]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[32]  Korris Fu-Lai Chung,et al.  Generalized Fuzzy C-Means Clustering Algorithm With Improved Fuzzy Partitions , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[33]  Chen Li Adaptive Algorithm for Soft Subspace Clustering , 2010 .

[34]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[35]  Zhaohong Deng,et al.  Enhanced soft subspace clustering integrating within-cluster and between-cluster information , 2010, Pattern Recognit..

[36]  Lawrence O. Hall,et al.  Convergence of the Single-Pass and Online Fuzzy C-Means Algorithms , 2011, IEEE Transactions on Fuzzy Systems.

[37]  Tossapon Boongoen,et al.  A Link-Based Approach to the Cluster Ensemble Problem , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Zhaohong Deng,et al.  A Novel Text Clustering Algorithm Based on Feature Weighting Distance and Soft Subspace Learning , 2012 .