Hyper-ellipsoidal clustering technique for evolving data stream

Abstract Data mining has become a key ingredient in establishing intelligent decision support systems. As one of main branches in data mining, data stream clustering has received much attention over the past decade. Most existing data stream clustering techniques count on Euclidean distance metric for finding similar objects and hence produce spherical clusters which are not always suitable to represent the data. Moreover, in most of the real world problems, we come across the data of varying density which cannot be handled by density-based clustering techniques. In this paper, we introduce a new clustering technique called Hyper-Ellipsoidal Clustering for Evolving data Stream (HECES) based on the recently proposed HyCARCE algorithm. In HECES, a few modifications in the HyCARCE algorithm are made for handling stream clustering problem: sliding window model is used to handle incoming stream of data to minimize the impact of the obsolete information on recent clustering results; shrinkage technique is used to avoid the singularity issue in finding the covariance of correlated data; a novel technique for merging the initial ellipsoids is used to obtain the final clusters instead of a computationally intensive process of expansion and adjustment. HECES relies on Mahalanobis distance metric to cluster the data points and hence results in ellipsoidal shaped clusters. It can successfully handle data of varying density. Experiments on various synthetic and real datasets for clustering streaming data provide a comparative validation of our approach.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Jing Gao,et al.  An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection , 2005, PAKDD.

[3]  Clarence C. Y. Kwan An Introduction to Shrinkage Estimation of the Covariance Matrix: A Pedagogic Illustration , 2011 .

[4]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[5]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[6]  R. Kadmon,et al.  Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance , 2003 .

[7]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Sudipto Guha,et al.  Clustering data streams , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[9]  Noureddine Zerhouni,et al.  Evidential evolving Gustafson-Kessel algorithm for online data streams partitioning using belief function theory , 2012, Int. J. Approx. Reason..

[10]  Hai Huang,et al.  A three-step clustering algorithm over an evolving data stream , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[11]  Hongjie Jia,et al.  Research on data stream clustering algorithms , 2013, Artificial Intelligence Review.

[12]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[13]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[14]  Myra Spiliopoulou,et al.  C-DBSCAN: Density-Based Clustering with Constraints , 2009, RSFDGrC.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[17]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[18]  Alfred O. Hero,et al.  Shrinkage estimation of high dimensional covariance matrices , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Christopher Leckie,et al.  An efficient hyperellipsoidal clustering algorithm for resource-constrained environments , 2011, Pattern Recognit..

[20]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[21]  Daniela Fogli,et al.  Knowledge-centered design of decision support systems for emergency management , 2013, Decis. Support Syst..

[22]  Alireza Rezaei Mahdiraji Clustering data stream: A survey of algorithms , 2009, Int. J. Knowl. Based Intell. Eng. Syst..

[23]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[24]  E. Lughofer,et al.  Evolving fuzzy classifiers using different model architectures , 2008, Fuzzy Sets Syst..

[25]  Xuegang Hu,et al.  Learning from concept drifting data streams with unlabeled data , 2012, Neurocomputing.

[26]  Myra Spiliopoulou,et al.  C-DenStream: Using Domain Knowledge on a Data Stream , 2009, Discovery Science.

[27]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[28]  Uzay Kaymak,et al.  Improved covariance estimation for Gustafson-Kessel clustering , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[29]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[30]  Patrick M Kelly An Algorithm for Merging Hyperellipsoidal Clusters , 1994 .

[31]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[32]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[33]  Tao Li,et al.  Exploiting empirical variance for data stream classification , 2012 .

[34]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.