Passive Approach for the K-means Problem on Streaming Data

Currently the amount of data produced worldwide is increasing beyond measure, thus a high volume of unsupervised data must be processed continuously. One of the main unsupervised data analysis is clustering. In streaming data scenarios, the data is composed by an increasing sequence of batches of samples where the concept drift phenomenon may happen. In this paper, we formally define the Streaming $K$-means(S$K$M) problem, which implies a restart of the error function when a concept drift occurs. We propose a surrogate error function that does not rely on concept drift detection. We proof that the surrogate is a good approximation of the S$K$M error. Hence, we suggest an algorithm which minimizes this alternative error each time a new batch arrives. We present some initialization techniques for streaming data scenarios as well. Besides providing theoretical results, experiments demonstrate an improvement of the converged error for the non-trivial initialization methods.

[1]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  José Antonio Lozano,et al.  An efficient approximation to the K-means clustering for massive data , 2017, Knowl. Based Syst..

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[5]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[6]  Eduardo Freire Nakamura,et al.  An incremental technique for real-time bioacoustic signal segmentation , 2015, Expert Syst. Appl..

[7]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[8]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[9]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[10]  Juan Manuel Jiménez-Soto,et al.  Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models. , 2018, Analytica chimica acta.

[11]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[12]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[13]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[14]  Joshua D. Knowles,et al.  Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach , 2016, Monthly Notices of the Royal Astronomical Society.

[15]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[16]  Shini Renjith,et al.  Evaluation of Partitioning Clustering Algorithms for Processing Social Media Data in Tourism Domain , 2018, 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS).

[17]  Sami Sieranoja,et al.  How much can k-means be improved by using better initialization and repeats? , 2019, Pattern Recognit..

[18]  Unil Yun,et al.  Sliding window based weighted erasable stream pattern mining for stream data applications , 2016, Future Gener. Comput. Syst..

[19]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[20]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[21]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[22]  Cheong Hee Park,et al.  An Efficient Concept Drift Detection Method for Streaming Data under Limited Labeling , 2017, IEICE Trans. Inf. Syst..

[23]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[24]  Sarajane Marques Peres,et al.  Gesture unit segmentation using support vector machines: segmenting gestures from rest positions , 2013, SAC '13.

[25]  Jose A. Lozano,et al.  An Efficient Split-Merge Re-Start for the $K$K-Means Algorithm , 2022, IEEE Trans. Knowl. Data Eng..

[26]  Santiago Marco,et al.  Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors. , 2018, Analytica chimica acta.

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .