An evolutionary algorithm for clustering data streams with a variable number of clusters

An evolutionary algorithm for clustering data stream is proposed.Our algorithm allows estimating k automatically from the data in an online fashion.It monitors eventual degradation in the quality of the induced clusters.Results show our algorithm correctly detects, and react to, changes in a data stream.The proposed method is very competitive in terms of accuracy and time processing. Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the difficulty in choosing k, data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page-Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk and StreamKM++-BkM. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.

[1]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2]  Rodrigo Fernandes de Mello,et al.  Data stream dynamic clustering supported by Markov chain isomorphisms , 2013, Intell. Data Anal..

[3]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[4]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[5]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[6]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[8]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[9]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[10]  Ricardo J. G. B. Campello,et al.  Evolutionary algorithms for clustering gene-expression data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[11]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010 .

[12]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[13]  Ricardo J. G. B. Campello,et al.  Towards a Fast Evolutionary Algorithm for Clustering , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[14]  Eduardo R. Hruschka,et al.  Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[15]  Xindong Wu,et al.  Robust ensemble learning for mining noisy data streams , 2011, Decis. Support Syst..

[16]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[17]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[18]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .

[19]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[20]  Matjaz Gams,et al.  An Agent-Based Approach to Care in Independent Living , 2010, AmI.

[21]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  João Gama,et al.  Monitoring Incremental Histogram Distribution for Change Detection in Data Streams , 2008, KDD Workshop on Knowledge Discovery from Sensor Data.

[25]  Moamar Sayed Mouchaweh,et al.  Learning in Dynamic Environments: Application to the Identification of Hybrid Dynamic Systems , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[26]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[27]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[28]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[29]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[30]  Sergei Vassilvitskii,et al.  Scalable K-Means by ranked retrieval , 2014, WSDM.

[31]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[32]  Saeed Shahrivari,et al.  High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[33]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[34]  Ricardo J. G. B. Campello,et al.  Comparison Among Methods for k Estimation in k-means , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[35]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[36]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[37]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[38]  H. Mouss,et al.  Test of Page-Hinckley, an approach for fault detection in an agro-alimentary production system , 2004, 2004 5th Asian Control Conference (IEEE Cat. No.04EX904).

[39]  Edwin Lughofer,et al.  Identifying static and dynamic prediction models for NOx emissions with evolving fuzzy systems , 2011, Appl. Soft Comput..

[40]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[41]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[42]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[43]  Ricardo J. G. B. Campello,et al.  Fast Evolutionary Algorithms for Relational Clustering , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[44]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[45]  Lei Wang,et al.  A collaborative divide-and-conquer K-means clustering algorithm for processing large data , 2014, Conf. Computing Frontiers.

[46]  Brian Everitt,et al.  Cluster analysis , 1974 .

[47]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Efficiency issues of evolutionary k-means , 2011, Appl. Soft Comput..

[48]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[49]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[50]  Edwin Lughofer A dynamic split-and-merge approach for evolving cluster models , 2012, Evol. Syst..

[51]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[52]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .