Parallel time series join using spark

A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large‐scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

[1]  Moncef Gabbouj,et al.  Epileptic Seizure Classification of EEG Time-Series Using Rational Discrete Short-Time Fourier Transform , 2015, IEEE Transactions on Biomedical Engineering.

[2]  Eamonn J. Keogh,et al.  Exploiting a novel algorithm and GPUs to break the ten quadrillion pairwise comparisons barrier for time series motifs and joins , 2017, Knowledge and Information Systems.

[3]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[4]  Eamonn J. Keogh,et al.  Dot plots for time series analysis , 2005, 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05).

[5]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[6]  Torben Bach Pedersen,et al.  Time Series Management Systems: A Survey , 2017, IEEE Transactions on Knowledge and Data Engineering.

[7]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[8]  Daniel A Beard,et al.  Identifying physiological origins of baroreflex dysfunction in salt-sensitive hypertension in the Dahl SS rat. , 2010, Physiological genomics.

[9]  Duong Tuan Anh,et al.  An Efficient Method for Time Series Join on Subsequence Correlation Using Longest Common Substring Algorithm , 2016, ICCASA.

[10]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[11]  Wei Lu,et al.  Efficient and exact duplicate detection on cloud , 2013, Concurr. Comput. Pract. Exp..

[12]  Lucas Lacasa,et al.  Irreversibility of financial time series: a graph-theoretical approach , 2016, 1601.01980.

[13]  Ramchandra Manthalkar,et al.  Time series decomposition and predictive analytics using MapReduce framework , 2019, Expert Syst. Appl..

[14]  Xing Wang,et al.  A Self-Learning and Online Algorithm for Time Series Anomaly Detection, with Application in CPU Manufacturing , 2016, CIKM.

[15]  Alicia Troncoso Lora,et al.  A novel spark-based multi-step forecasting algorithm for big data time series , 2018, Inf. Sci..

[16]  Trilce Estrada,et al.  Time Series Join on Subsequence Correlation , 2014, 2014 IEEE International Conference on Data Mining.

[17]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[18]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[19]  Hung-Hsuan Huang,et al.  Time Series Classification Method Based on Longest Common Subsequence and Textual Approximation , 2012, Seventh International Conference on Digital Information Management (ICDIM 2012).

[20]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[21]  Mang I Vai,et al.  Time series for blind biosignal classification model , 2014, Comput. Biol. Medicine.

[22]  Pavlos Protopapas,et al.  Finding anomalous periodic time series , 2009, Machine Learning.

[23]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[24]  Eamonn J. Keogh,et al.  Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile , 2017, Data Mining and Knowledge Discovery.

[25]  Robert A. van de Geijn,et al.  BLAS (Basic Linear Algebra Subprograms) , 2011, Encyclopedia of Parallel Computing.

[26]  Eamonn J. Keogh,et al.  Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[27]  Eamonn J. Keogh,et al.  Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[28]  Jianzhong Li,et al.  Set-based Similarity Search for Time Series , 2016, SIGMOD Conference.

[29]  Tak-Chung Fu,et al.  A review on time series data mining , 2011, Eng. Appl. Artif. Intell..

[30]  Man Lung Yiu,et al.  Efficient discovery of longest-lasting correlation in sequence databases , 2016, The VLDB Journal.

[31]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[32]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[33]  Ziliang Chen,et al.  Similarity joins for high‐dimensional data using Spark , 2019, Concurr. Comput. Pract. Exp..

[34]  Eamonn J. Keogh,et al.  Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[35]  Gang Chen,et al.  Efficient Processing of Warping Time Series Join of Motion Capture Data , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.