Online learning algorithms which address fast data streams should process examples at the rate they arrive, using a single scan of data and fixed memory, maintaining a decision model at any time and being able to adapt the model to the most recent data. These features yield the necessity of using approximate models. One problem that usually arises with approximate models is the definition of a minimum number of observations necessary to assure convergence, which implies a high risk since the system may have to decide based only on a small subset of the entire data. One approach is to apply techniques based on the Hoeffding bound to enforce decisions with a confidence level. In divisive clustering of time series, the goal is to find clusters of similar time series over time. In online approaches there are two decisions to make: when to split and how to assign variables to new clusters. We can define a confidence level to both the decision of splitting and the assignment of data variables to new clusters. Previous works have already addressed confident decisions on the moment of split. Our proposal is to include a confidence level to the assignment process. When a split point is reported, creating two new clusters, we can directly assign points which are confidently closer to one cluster than the other, having a different strategy for those variables which do not satisfy the confidence level. In this paper we propose to assign the unsure variables to a third cluster. Experimental evaluation is presented in the context of a recently proposed hierarchical algorithm, assessing the advantages of the proposal, revealing also advantages on memory usage reduction and processing speed. Although this proposal is evaluated under the scope of an existent method, it can be generalized to any divisive procedure.
[1]
João Gama,et al.
Stream-Based Electricity Load Forecast
,
2007,
PKDD.
[2]
Michalis Vazirgiannis,et al.
On Clustering Validation Techniques
,
2001,
Journal of Intelligent Information Systems.
[3]
N. Fisher,et al.
Probability Inequalities for Sums of Bounded Random Variables
,
1994
.
[4]
Ali S. Hadi,et al.
Finding Groups in Data: An Introduction to Chster Analysis
,
1991
.
[5]
Eyke Hüllermeier,et al.
Online clustering of parallel data streams
,
2006,
Data Knowl. Eng..
[6]
M. Moy,et al.
Using hierarchical clustering methods to classify motor activities of COPD patients from wearable sensor data
,
2005,
Journal of NeuroEngineering and Rehabilitation.
[7]
W. D. Wightman.
Philosophical Transactions of the Royal Society
,
1961,
Nature.
[8]
Geoff Hulten,et al.
Mining high-speed data streams
,
2000,
KDD '00.
[9]
João Gama,et al.
Semi-fuzzy Splitting in Online Divisive-Agglomerative Clustering
,
2007,
EPIA Workshops.
[10]
J. Dunn.
Well-Separated Clusters and Optimal Fuzzy Partitions
,
1974
.
[11]
Petra Perner,et al.
Data Mining - Concepts and Techniques
,
2002,
Künstliche Intell..
[12]
Ming-Syan Chen,et al.
Adaptive Clustering for Multiple Evolving Streams
,
2006,
IEEE Transactions on Knowledge and Data Engineering.
[13]
W. Hoeffding.
Probability Inequalities for sums of Bounded Random Variables
,
1963
.
[14]
K. Pearson.
Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia
,
1896
.
[15]
João Gama,et al.
Hierarchical Clustering of Time-Series Data Streams
,
2008,
IEEE Transactions on Knowledge and Data Engineering.
[16]
R. Mantegna.
Hierarchical structure in financial markets
,
1998,
cond-mat/9802256.