Finding the most unusual time series subsequence: algorithms and applications

In this work we introduce the new problem of finding time seriesdiscords. Time series discords are subsequences of longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. While discords have many uses for data mining, they are particularly attractive as anomaly detectors because they only require one intuitive parameter (the length of the subsequence) unlike most anomaly detection algorithms that typically require many parameters. While the brute force algorithm to discover time series discords is quadratic in the length of the time series, we show a simple algorithm that is three to four orders of magnitude faster than brute force, while guaranteed to produce identical results. We evaluate our work with a comprehensive set of experiments on diverse data sources including electrocardiograms, space telemetry, respiration physiology, anthropological and video datasets.

[1]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[2]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[3]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[4]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[5]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  John G. Fleagle,et al.  Primate Adaptation and Evolution , 1989 .

[8]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[9]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[10]  Dipankar Dasgupta,et al.  Novelty detection in time series data using ideas from immunology , 1996 .

[11]  Catherine Garbay,et al.  Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare , 2004, ArXiv.

[12]  Aristides Gionis,et al.  Finding recurrent sources in sequences , 2003, RECOMB '03.

[13]  Jian Tang,et al.  On Complementarity of Cluster and Outlier Detection Schemes , 2003, DaWaK.

[14]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[15]  Nitin Kumar,et al.  Time-series Bitmaps: a Practical Visualization Tool for Working with Large Time Series Databases , 2005, SDM.

[16]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[17]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[18]  Jessica Lin,et al.  Visually mining and monitoring massive time series , 2004, KDD.

[19]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[20]  Eamonn J. Keogh,et al.  Three Myths about Dynamic Time Warping Data Mining , 2005, SDM.

[21]  Giorgio Terracina,et al.  Discovering Representative Models in Large Time Series Databases , 2004, FQAS.

[22]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.