Finding Unusual Medical Time-Series Subsequences: Algorithms and Applications

In this work, we introduce the new problem of finding time series discords. Time series discords are subsequences of longer time series that are maximally different to all the rest of the time series subsequences. They thus capture the sense of the most unusual subsequence within a time series. While discords have many uses for data mining, they are particularly attractive as anomaly detectors because they only require one intuitive parameter (the length of the subsequence), unlike most anomaly detection algorithms that typically require many parameters. While the brute force algorithm to discover time series discords is quadratic in the length of the time series, we show a simple algorithm that is three to four orders of magnitude faster than brute force, while guaranteed to produce identical results. We evaluate our work with a comprehensive set of experiments on electrocardiograms and other medical datasets

[1]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[2]  Catherine Garbay,et al.  Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare , 2004, ArXiv.

[3]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[4]  Nitin Kumar,et al.  Time-series Bitmaps: a Practical Visualization Tool for Working with Large Time Series Databases , 2005, SDM.

[5]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[6]  Jian Tang,et al.  On Complementarity of Cluster and Outlier Detection Schemes , 2003, DaWaK.

[7]  Bin Ma,et al.  Distinguishing string selection problems , 2003, SODA '99.

[8]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[9]  Dipankar Dasgupta,et al.  Novelty detection in time series data using ideas from immunology , 1996 .

[10]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[11]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[12]  Jessica Lin,et al.  Visually mining and monitoring massive time series , 2004, KDD.

[13]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[14]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[15]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[16]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[17]  Giorgio Terracina,et al.  Discovering Representative Models in Large Time Series Databases , 2004, FQAS.

[18]  Cyrus Shahabi,et al.  TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries on time-series data , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[19]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.