Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining

Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the "tip of the iceberg" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.

[1]  Tim Oates,et al.  Motif discovery in spatial trajectories using grammar inference , 2013, CIKM.

[2]  Abdullah Mueen,et al.  Time series motif discovery: dimensions and applications , 2014, WIREs Data Mining Knowl. Discov..

[3]  Jian Pei,et al.  WAT: Finding Top-K Discords in Time Series Database , 2007, SDM.

[4]  Eamonn J. Keogh,et al.  Finding the most unusual time series subsequence: algorithms and applications , 2006, Knowledge and Information Systems.

[5]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[6]  Eamonn J. Keogh,et al.  Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[7]  Nuno Constantino Castro,et al.  Time Series Data Mining , 2009, Encyclopedia of Database Systems.

[8]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[9]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[10]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[11]  Majid Sarrafzadeh,et al.  Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series , 2009, IJCAI.

[12]  G.B. Moody,et al.  The impact of the MIT-BIH Arrhythmia Database , 2001, IEEE Engineering in Medicine and Biology Magazine.

[13]  Eamonn J. Keogh,et al.  Mining motifs in massive time series databases , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.

[15]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[16]  Zongda Wu,et al.  Finding time series discord based on bit representation clustering , 2013, Knowl. Based Syst..

[17]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[18]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[19]  Abdullah Mueen,et al.  Enumeration of time series motifs of all lengths , 2013, 2013 IEEE 13th International Conference on Data Mining.

[20]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[21]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[22]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[23]  Yongxin Zhu,et al.  J-Distance Discord: An Improved Time Series Discord Definition and Discovery Method , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[24]  Eamonn J. Keogh,et al.  Finding Time Series Discords Based on Haar Transform , 2006, ADMA.

[25]  H. Schulz,et al.  Phasic or transient? Comment on the terminology of the AASM manual for the scoring of sleep and associated events. , 2007, Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine.

[26]  Eamonn J. Keogh,et al.  Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[27]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[28]  Arno Schlueter,et al.  Automated daily pattern filtering of measured building performance data , 2015 .

[29]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[30]  Li Wei,et al.  SAXually Explicit Images: Finding Unusual Shapes , 2006, Sixth International Conference on Data Mining (ICDM'06).

[31]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[32]  Kai Ming Ting,et al.  Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors , 2016, Machine Learning.

[33]  Charu C. Aggarwal,et al.  Theoretical Foundations and Algorithms for Outlier Ensembles , 2015, SKDD.

[34]  Eamonn J. Keogh,et al.  HOT SAX: efficiently finding the most unusual time series subsequence , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[35]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[36]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[37]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[38]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[39]  Volker Lohweg,et al.  Survey on time series motif discovery , 2017, WIREs Data Mining Knowl. Discov..

[40]  Kai Ming Ting,et al.  Isolation‐based anomaly detection using nearest‐neighbor ensembles , 2018, Comput. Intell..

[41]  Eamonn J. Keogh,et al.  Matrix Profile V: A Generic Technique to Incorporate Domain Knowledge into Motif Discovery , 2017, KDD.