Clustering is a very well studied problem that attempts to group similar data points. Most traditional clustering algorithms assume that the data is provided without measurement error. Often, however, real world data sets have such errors and one can obtain estimates of these errors. We present a clustering method that incorporates information contained in these error estimates. We present a new distance function that is based on the distribution of errors in data. Using a Gaussian model for errors, the distance function follows a Chi-Square distribution and is easy to compute. This distance function is used in hierarchical clustering to discover meaningful clusters. The distance function is scale-invariant so that clustering results are independent of units of measuring data. In the special case when the error distribution is the same for each attribute of data points, the rank order of pair-wise distances is the same for our distance function and the Euclidean distance function. The clustering method is applied to the seasonality estimation problem and experimental results are presented for the retail industry data as well as for simulated data, where it outperforms classical clustering methods.
[1]
Anil K. Jain,et al.
Algorithms for Clustering Data
,
1988
.
[2]
Eamonn J. Keogh,et al.
An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback
,
1998,
KDD.
[3]
J. H. Ward.
Hierarchical Grouping to Optimize an Objective Function
,
1963
.
[4]
D. G. Morrison,et al.
A Decision Support System for Planning Manufacturers' Sales Promotion Calendars
,
1999
.
[5]
Kyuseok Shim,et al.
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases
,
1995,
VLDB.
[6]
J. Rice.
Mathematical Statistics and Data Analysis
,
1988
.
[7]
Padhraic Smyth,et al.
Trajectory clustering with mixtures of regression models
,
1999,
KDD '99.
[8]
Carl F. Mela,et al.
The Dynamic Effect of Discounting on Sales: Empirical Analysis and Normative Pricing Implications
,
1999
.
[9]
Anil K. Jain,et al.
Data clustering: a review
,
1999,
CSUR.