Exploiting a novel algorithm and GPUs to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins

Time series motifs are approximately repeated subsequences found within a longer time series. They have been in the literature since 2002, but recently they have begun to receive significant attention in research and industrial communities. This is perhaps due to the growing realization that they implicitly offer solutions to a host of time series problems, including rule discovery, anomaly detection, density estimation, semantic segmentation, summarization, etc. Recent work has improved the scalability so exact motifs can be computed on datasets with up to a million data points in tenable time. However, in some domains, for example seismology or climatology, there is an immediate need to address even larger datasets. In this work, we demonstrate that a combination of a novel algorithm and a high-performance GPU allows us to significantly improve the scalability of motif discovery. We demonstrate the scalability of our ideas by finding the full set of exact motifs on a dataset with one hundred and forty-three million subsequences, which is by far the largest dataset ever mined for time series motifs/joins; it requires ten quadrillion pairwise comparisons. Furthermore, we demonstrate that our algorithm can produce actionable insights into seismology and ethology.

[1]  Stephen D. Malone,et al.  Swarms of repeating stick‐slip icequakes triggered by snow loading at Mount Rainier volcano , 2013 .

[2]  G. Beroza,et al.  An autocorrelation method to detect low frequency earthquakes within tremor , 2008 .

[3]  G. Beroza,et al.  Low-frequency earthquakes in Shikoku, Japan, and their relationship to episodic tremor and slip , 2006, Nature.

[4]  Robert M. Nadeau,et al.  Precise location of San Andreas Fault tremors near Cholame, California using seismometer clusters: Slip on the deep extension of the fault? , 2009 .

[5]  R. Sparks,et al.  Forecasting volcanic eruptions , 2003 .

[6]  Jun Wang,et al.  Discovering Multidimensional Motifs in Physiological Signals for Personalized Healthcare , 2016, IEEE Journal of Selected Topics in Signal Processing.

[7]  Amy McGovern,et al.  Identifying predictive multi-dimensional time series motifs: an application to severe weather prediction , 2010, Data Mining and Knowledge Discovery.

[8]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[9]  Clara E Yoon,et al.  Earthquake detection through computationally efficient similarity search , 2015, Science Advances.

[10]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[11]  Eamonn J. Keogh,et al.  Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[12]  Eamonn J. Keogh,et al.  Experimental comparison of representation methods and distance measures for time series data , 2010, Data Mining and Knowledge Discovery.

[13]  Rory P. Wilson,et al.  In-depth studies of Magellanic penguin (Spheniscus magellanicus) foraging: can we estimate prey consumption by perturbations in the dive profile? , 2003 .

[14]  Haizhou Li,et al.  A tree-construction search approach for multivariate time series motifs discovery , 2010, Pattern Recognit. Lett..

[15]  Peter Bailis,et al.  Prioritizing Attention in Analytic Monitoring , 2017, CIDR.

[16]  Peter Bailis,et al.  Prioritizing Attention in Analytic Monitoring , 2017, CIDR.

[17]  Jens Havskov,et al.  Instrumentation in Earthquake Seismology , 2005 .

[18]  Eamonn J. Keogh,et al.  Exact Discovery of Time Series Motifs , 2009, SDM.

[19]  Xiaofeng Meng,et al.  Detecting Earthquakes around Salton Sea Following the 2010 Mw7.2 El Mayor-Cucapah Earthquake Using GPU Parallel Computing , 2012, ICCS.

[20]  Man Lung Yiu,et al.  Quick-motif: An efficient and scalable framework for exact motif discovery , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[21]  Akira Hasegawa,et al.  Repeating earthquakes and interplate aseismic slip in the northeastern Japan subduction zone , 2003 .

[22]  Eamonn J. Keogh,et al.  Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping , 2013, TKDD.

[23]  Jack J. Purdum,et al.  C programming guide , 1983 .

[24]  G. Beroza,et al.  Non-volcanic tremor and low-frequency earthquake swarms , 2007, Nature.

[25]  Laura J. Grundy,et al.  A dictionary of behavioral motifs reveals clusters of genes affecting Caenorhabditis elegans locomotion , 2012, Proceedings of the National Academy of Sciences.

[26]  Kuniaki Uehara,et al.  Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle , 2005, Machine Learning.

[27]  Jon J. Major,et al.  Dynamics of seismogenic volcanic extrusion at Mount St Helens in 2004–05 , 2006, Nature.

[28]  Majid Sarrafzadeh,et al.  Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series , 2009, IJCAI.

[29]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[30]  Peter Bailis,et al.  ASAP: Prioritizing Attention via Time Series Smoothing , 2017, Proc. VLDB Endow..

[31]  Irfan A. Essa,et al.  Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning , 2007, AAAI.

[32]  Lionel M. Ni,et al.  Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[33]  R. Geller,et al.  Four similar earthquakes in central California , 1980 .