Streaming Data Analysis: Clustering or Classification?

This article is a position paper about models and algorithms that are generally called “stream clustering.” Semantics and methods used in this field are often co-opted from static clustering, but they do not serve well for streaming data analysis. Most “state-of-the-art” methods, such as sequential k-means, Birch, CluStream, DenStream, etc., acknowledge that the data are seen but once in real streaming analysis (e.g., intrusion detection, voter fraud, etc.). Interpretation of their outputs generally overlooks the fact that when the data cannot be saved, batch clustering ideas, such as preclustering assessment, partitioning, and cluster validity are not relevant. But in the current literature, the data, or some subset of it, are often saved for hindsight evaluation (we call this fake stream clustering). Our position? Useful analysis of real streaming data is in its infancy. We do not argue that current approaches to streaming clustering are wrong: rather, we regard them as transitional methods which will eventually lead to a new and useful paradigm for this type of computation. We think that this class of models and algorithms are actually classifiers, but with a special added component, viz., continuously updated cluster footprints of the instream processing. We need to carefully define the objectives of streaming analysis, and then choose terminology and methods that suit this evolving paradigm.

[1]  George S. Sebestyen,et al.  Decision-making processes in pattern recognition , 1962 .

[2]  Marimuthu Palaniswami,et al.  Visual Structural Assessment and Anomaly Detection for High-Velocity Data Streams. , 2020, IEEE transactions on cybernetics.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  En Zhu,et al.  Deep Clustering with Convolutional Autoencoders , 2017, ICONIP.

[5]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[6]  James M. Keller,et al.  Robust On-Line Streaming Clustering , 2018, IPMU.

[7]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[8]  StreamSoNG: A Soft Streaming Classification Approach , 2020, ArXiv.

[9]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[10]  Albert Bifet,et al.  MACHINE LEARNING FOR DATA STREAMS , 2018 .

[11]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[12]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[13]  Carlo H. Séquin,et al.  Optimal adaptive k-means algorithm with dynamic adjustment of learning rate , 1995, IEEE Trans. Neural Networks.

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[16]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  M. Emre Celebi,et al.  Partitional Clustering Algorithms , 2014 .

[19]  James C. Bezdek,et al.  An Efficient Formulation of the Improved Visual Assessment of Cluster Tendency (iVAT) Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[20]  James M. Keller,et al.  Evaluating Evolving Structure in Streaming Data With Modified Dunn's Indices , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[21]  James Bailey,et al.  Online cluster validity indices for performance monitoring of streaming data clustering , 2018, Int. J. Intell. Syst..

[22]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[23]  John E. Moody,et al.  Fast adaptive k-means clustering: some empirical results , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[24]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[25]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[26]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[27]  Azuraliza Abu Bakar,et al.  Data stream clustering algorithms: A review , 2015, SOCO 2015.

[28]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[29]  Matthias Carnein,et al.  Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms , 2019, Bus. Inf. Syst. Eng..