Constraint-based discriminative dimension selection for high-dimensional stream clustering

Clustering data streams is one of active research topic in data mining. However, runtime of the existing stream clustering algorithms increases and their performance drop in the face of large number of dimensions. Complexity of the stream clustering methods is increased when perform on data with large number of dimensions. In order to reduce the clustering complexity, one possible solution consists in determining the appropriate subset of cluster dimensions via dimension projection. SED-Stream is an efficient clustering algorithm that supports high dimension data streams. The aim of this paper is to increase performance of SED-Stream in terms of both clustering quality and execution-time. In order to improve the clustering process, background or domain expert knowledge are integrated as “constraints” in SEDC-Stream. The new algorithm, SEDC-Stream, supports the evolving characteristics of the dynamic constraints which are activation, fading, outdating and prioritization. SEDC-Stream algorithm is able to reduce cluster splitting time, and place new incoming points to their suitable clusters. Compared to SED-Stream on the three real-world streams datasets, SEDC-Stream is able to generate a better clustering performance in terms of both purity and f-measure.

[1]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[2]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[3]  Thanapat Kangkachit,et al.  Evolution-Based Clustering Technique for Data Streams with Uncertainty , 2012 .

[4]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[5]  Keke Chen,et al.  HE-Tree: a framework for detecting changes in clustering structure for categorical data streams , 2009, The VLDB Journal.

[6]  Suphakant Phimoltares,et al.  A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction , 2017, Inf. Sci..

[7]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[8]  Violaine Antoine,et al.  Evidential seed-based semi-supervised clustering , 2014, 2014 Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS).

[9]  Murat Ekinci,et al.  A graph form data stream clustering approach based on dimension reduction , 2017, 2017 25th Signal Processing and Communications Applications Conference (SIU).

[10]  Thanapat Kangkachit,et al.  Efficient evolution-based clustering of high dimensional data streams with dimension projection , 2013, 2013 International Computer Science and Engineering Conference (ICSEC).

[11]  Tossaporn Sirampuj,et al.  CE-Stream : Evaluation-based technique for stream clustering with constraints , 2013, The 2013 10th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[12]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[13]  Kitsana Waiyamai,et al.  Semi-Supervised Stream Clustering Using Labeled Data Points , 2015, MLDM.

[14]  Mustapha Lebbah,et al.  State-of-the-art on clustering data streams , 2016 .

[15]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[16]  Kitsana Waiyamai,et al.  E-Stream: Evolution-Based Technique for Stream Clustering , 2007, ADMA.

[17]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[18]  Nikos Pelekis,et al.  An evaluation of data stream clustering algorithms , 2018, Stat. Anal. Data Min..

[19]  Thanapat Kangkachit,et al.  SED-Stream: discriminative dimension selection for evolution-based clustering of high dimensional data streams , 2014, Int. J. Intell. Syst. Technol. Appl..

[20]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[21]  Myra Spiliopoulou,et al.  C-DenStream: Using Domain Knowledge on a Data Stream , 2009, Discovery Science.

[22]  Ge Yu,et al.  Clustering Stream Data by Exploring the Evolution of Density Mountain , 2017, Proc. VLDB Endow..

[23]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[24]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[25]  Myra Spiliopoulou,et al.  Density-based semi-supervised clustering , 2010, Data Mining and Knowledge Discovery.

[26]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[27]  Irfan Ahmed,et al.  A Novel High Dimensional and High Speed Data Streams Algorithm: HSDStream , 2016 .

[28]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .