WCDS: A Two-Phase Weightless Neural System for Data Stream Clustering

Clustering is a powerful and versatile tool for knowledge discovery, able to provide a valuable information for data analysis in various domains. To perform this task based on streaming data is quite challenging: outdated knowledge needs to be disposed while the current knowledge is obtained from fresh data; since data are continuously flowing, strict efficiency constraints have to be met. This paper presents WCDS, an approach to this problem based on the WiSARD artificial neural network model. This model already had useful characteristics as inherent incremental learning capability and patent functioning speed. These were combined with novel features as an adaptive countermeasure to cluster imbalance, a mechanism to discard expired data, and offline clustering based on a pairwise similarity measure for WiSARD discriminators. In an insightful experimental evaluation, the proposed system had an excellent performance according to multiple quality standards. This supports its applicability for the analysis of data streams.

[1]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[2]  Aoying Zhou,et al.  Efficient clustering of uncertain data streams , 2013, Knowledge and Information Systems.

[3]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[4]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[8]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[9]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[10]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[11]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[12]  João Gama,et al.  A Weightless Neural Network-Based Approach for Stream Data Clustering , 2012, IDEAL.

[13]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[14]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[15]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[16]  A. Kolcz Application of the CMAC input encoding scheme in the N-tuple approximation net , 1994 .

[17]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[18]  Felipe Maia Galvão França,et al.  Financial credit analysis via a clustering weightless neural classifier , 2016, Neurocomputing.

[19]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[20]  Jean Paul Barddal,et al.  A Complex Network-Based Anytime Data Stream Clustering Algorithm , 2015, ICONIP.

[21]  João Gama,et al.  Clustering data streams using a forgetful neural model , 2016, SAC.

[22]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[23]  João Gama,et al.  Clustering data streams with weightless neural networks , 2011, ESANN.

[24]  Jean Paul Barddal,et al.  SNCStream+: Extending a high quality true anytime data stream clustering algorithm , 2016, Inf. Syst..

[25]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[26]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[27]  João Gama,et al.  A bounded neural network for open set recognition , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[28]  Rajeev Rastogi,et al.  Data Stream Management: Processing High-Speed Data Streams (Data-Centric Systems and Applications) , 2019 .

[29]  Sadique Sheik,et al.  Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring , 2015 .

[30]  Thomas Seidl,et al.  Using internal evaluation measures to validate the quality of diverse stream clustering algorithms , 2017, Vietnam Journal of Computer Science.

[31]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[32]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[33]  Thomas Martini Jørgensen,et al.  Discretization methods for encoding of continuous input variables for Boolean neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[34]  Philip S. Yu,et al.  Density-based clustering of data streams at multiple resolutions , 2009, TKDD.

[35]  I. Aleksander,et al.  WISARD·a radical step forward in image recognition , 1984 .

[36]  Massimo De Gregorio,et al.  Producing pattern examples from "mental" images , 2010, Neurocomputing.