Disk aware discord discovery: finding unusual time series in terabyte sized datasets

The problem of finding unusual time series has recently attracted much attention, and several promising methods are now in the literature. However, virtually all proposed methods assume that the data reside in main memory. For many real-world problems this is not be the case. For example, in astronomy, multi-terabyte time series datasets are the norm. Most current algorithms faced with data which cannot fit in main memory resort to multiple scans of the disk /tape and are thus intractable. In this work we show how one particular definition of unusual time series, the time series discord, can be discovered with a disk aware algorithm. The proposed algorithm is exact and requires only two linear scans of the disk with a tiny buffer of main memory. Furthermore, it is very simple to implement. We use the algorithm to provide further evidence of the effectiveness of the discord definition in areas as diverse as astronomy, web query mining, video surveillance, etc., and show the efficiency of our method on datasets which are many orders of magnitude larger than anything else attempted in the literature.

[1]  Li Wei,et al.  SAXually Explicit Images: Finding Unusual Shapes , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Eamonn J. Keogh,et al.  Finding Time Series Discords Based on Haar Transform , 2006, ADMA.

[3]  David Wai-Lok Cheung,et al.  Parallel Mining of Outliers in Large Database , 2004, Distributed and Parallel Databases.

[4]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[5]  Changzhou Wang,et al.  Multilevel Filtering for High Dimensional Nearest Neighbor Search , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[6]  P. Protopapas,et al.  Finding outlier light curves in catalogues of periodic variable stars , 2005, astro-ph/0505495.

[7]  Jamal Ameen,et al.  Mining Time Series for Identifying Unusual Sub-sequences with Applications , 2006, First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06).

[8]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Fabrizio Angiulli,et al.  Very efficient mining of distance-based outliers , 2007, CIKM '07.

[11]  Divyakant Agrawal,et al.  Accessing Scientifgic Data: Simpler is Better , 2003, SSTD.

[12]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[13]  Shehzad Khalid,et al.  Classifying spatiotemporal object trajectories using unsupervised learning in the coefficient feature space , 2006, Multimedia Systems.

[14]  Paul J. Fortier,et al.  Hierarchical Agglomerative Clustering Based T-outlier Detection , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  Dietrich Stoyan,et al.  On Estimators of the Nearest Neighbour Distance Distribution Function for Stationary Point Processes , 2006 .

[16]  Andrew W. Moore,et al.  Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables , 2000, UAI.

[17]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[18]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[19]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[20]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[21]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[22]  Yufei Tao,et al.  Mining distance-based outliers from large databases in any metric space , 2006, KDD '06.

[23]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[24]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[25]  Kerriann H. Malatesta,et al.  The AAVSO Data Validation Project , 2006 .

[26]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[27]  Eamonn J. Keogh,et al.  HOT SAX: efficiently finding the most unusual time series subsequence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[28]  Mooi Choo Chuah,et al.  ECG Anomaly Detection via Time Series Analysis , 2007, ISPA Workshops.

[29]  Edgar Acuña,et al.  Parallel algorithms for distance-based and density-based outliers , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[30]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[31]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[32]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.