Clustering of streaming time series is meaningless

Time series data is perhaps the most frequently encountered type of data examined by the data mining community. Clustering is perhaps the most frequently used data mining algorithm, being useful in it's own right as an exploratory technique, and also as a subroutine in more complex data mining algorithms such as rule discovery, indexing, summarization, anomaly detection, and classification. Given these two facts, it is hardly surprising that time series clustering has attracted much attention. The data to be clustered can be in one of two formats: many individual time series, or a single time series, from which individual time series are extracted with a sliding window. Given the recent explosion of interest in streaming data and online algorithms, the latter case has received much attention.In this work we make a surprising claim. Clustering of streaming time series is completely meaningless. More concretely, clusters extracted from streaming time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any dataset, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.We can justify calling our claim surprising, since it invalidates the contribution of dozens of previously published papers. We will justify our claim with a theorem, illustrative examples, and a comprehensive set of experiments on reimplementations of previous work. Although the primary contribution of our work is to draw attention to the fact that an apparent solution to an important problem is incorrect and should no longer be used, we also introduce a novel method which, based on the concept of time series motifs, is able to meaningfully cluster some streaming time series datasets.

[1]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[2]  Kuniaki Uehara,et al.  A Motion Recognition Method by Using Primitive Motions , 2000, VDB.

[3]  Shuai Wang,et al.  Mining of Moving Objects from Time-Series Images and its Application to Satellite Weather Imagery , 2004, Journal of Intelligent Information Systems.

[4]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[5]  John F. Roddick,et al.  A Survey of Temporal Knowledge Discovery Paradigms and Methods , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[7]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[8]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[9]  Kilian Stoffel,et al.  Classification Rules + Time = Temporal Rules , 2002, International Conference on Computational Science.

[10]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[11]  Georg Dorffner,et al.  Temporal pattern recognition in noisy non-stationary time series based on quantization into symbolic streams. Lessons learned from financial volatility trading. , 2000 .

[12]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[13]  Xiaoming Jin,et al.  Indexing and Mining of the Local Patterns in Sequence Database , 2002, IDEAL.

[14]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[15]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[16]  Allan Timmermann,et al.  Dangers of Data-Driven Inference: The Case of Calendar Effects in Stock Returns , 1998 .

[17]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[18]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19]  Magnus Lie Hetland,et al.  Temporal Rule Discovery using Genetic Programming and Specialized Hardware , 2004 .

[20]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[21]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[22]  Xiaoming Jin,et al.  Distribution Discovery: Local Analysis of Temporal Rules , 2002, PAKDD.

[23]  Kuniaki Uehara,et al.  Parallel Algorithms for Mining Association Rules in Time Series Data , 2003, ISPA.

[24]  Philipos C. Loizou,et al.  An Alternate Partitioning Technique to Quantify the Regularity of Complex Time Series , 2000, Int. J. Bifurc. Chaos.

[25]  David D. Jensen Data snooping, dredging and fishing: the dark side of data mining a SIGKDD99 panel report , 2000, SKDD.

[26]  Jitender S. Deogun,et al.  A Geospatial Decision Support System for Drought Risk Management , 2004, DG.O.

[27]  Georg Dorffner,et al.  The benefit of information reduction for trading strategies , 2002 .

[28]  Philip S. Yu,et al.  MALM: a framework for mining sequence database at multiple abstraction levels , 1998, CIKM '98.

[29]  Tim Oates,et al.  Identifying distinctive subsequences in multivariate time series by clustering , 1999, KDD '99.

[30]  D. N. Sparks,et al.  Time Series; Multivariate Analysis , 1977 .

[31]  A. F. Cholewa,et al.  A Guide to Species Irises: Their Identification and Cultivation. , 1998 .

[32]  Jitender S. Deogun,et al.  Efficient Rule Discovery in a Geo-spatial Decision Support System , 2002, DG.O.

[33]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[34]  Kristof Van Laerhoven Combining the Kohonen Self-Organizing Map and K-Means for On-Line Classification of Sensordata , 2001 .

[35]  Jitender S. Deogun,et al.  Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences , 2002, ISMIS.

[36]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[37]  Pang-Ning Tan,et al.  Temporal Data Mining for the Discovery and Analysis of Ocean Climate Indices , 2002 .

[38]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[39]  R. Mantegna Hierarchical structure in financial markets , 1998, cond-mat/9802256.

[40]  Kristof Van Laerhoven Combining the Self-Organizing Map and K-Means Clustering for On-Line Classification of Sensor Data , 2001, ICANN.

[41]  Yoshikiyo Kato,et al.  Fault Detection by Mining Association Rules from House-keeping Data , 2001 .

[42]  Kuniaki Uehara,et al.  Extraction of Primitive Motion and Discovery of Association Rules from Human Motion Data , 2002, Progress in Discovery Science.