A fast LSH-based similarity search method for multivariate time series

Abstract Due to advances in mobile devices and sensors, there has been an increasing interest in the analysis of multivariate time series. Identifying similar time series is a core subroutine in many data mining and analysis problems. However, existing solutions mainly focus on univariate time series and fail to scale as the number of dimensions increase. Although, dimensionality reduction can reduce the impact of noisy information, the number of dimensions may still be too large. In this paper, an efficient approximation method is proposed based on locality sensitive hashing. It is a two-step solution which firstly retrieves candidate time series and then exploits their hash values to compute distance estimates for pruning. To probabilistically guarantee the result accuracy, an extensive error analysis has been conducted to determine appropriate LSH parameters. In addition, we also apply the proposed method to the PkNN classification and hierarchical clustering workloads. Finally, extensive experiments are conducted using both the real multivariate time series and the high-dimensional representations generated from univariate datasets in different query processing and data analysis workloads. Empirical results have verified the findings from the error analyses and demonstrated their benefits in terms of query efficiency when dealing with a collection of multivariate time series.

[1]  Eamonn J. Keogh,et al.  Scaling and time warping in time series querying , 2005, The VLDB Journal.

[2]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[3]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[4]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[5]  Yan Liu,et al.  Functional Subspace Clustering with Application to Time Series , 2015, ICML.

[6]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[7]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[8]  Philip S. Yu,et al.  Attribute-Based Subsequence Matching and Mining , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[9]  Amy Loutfi,et al.  A review of unsupervised feature learning and deep learning for time-series modeling , 2014, Pattern Recognit. Lett..

[10]  C. Holmes,et al.  A probabilistic nearest neighbour method for statistical pattern recognition , 2002 .

[11]  Eamonn J. Keogh,et al.  Scalable Clustering of Time Series with U-Shapelets , 2015, SDM.

[12]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[13]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[14]  Sang-Wook Kim,et al.  Performance bottleneck of subsequence matching in time-series databases: Observation, solution, and performance evaluation , 2007, Inf. Sci..

[15]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[16]  Maciej Krawczak,et al.  An approach to dimensionality reduction in time series , 2014, Inf. Sci..

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Eamonn J. Keogh,et al.  Query Suggestion to allow Intuitive Interactive Search in Multidimensional Time Series , 2017, SSDBM.

[19]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[20]  Jun Wang,et al.  On the Non-Trivial Generalization of Dynamic Time Warping to the Multi-Dimensional Case , 2015, SDM.

[21]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[22]  Eamonn J. Keogh,et al.  A Novel Approximation to Dynamic Time Warping allows Anytime Clustering of Massive Time Series Datasets , 2012, SDM.

[23]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[24]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[25]  Cyrus Shahabi,et al.  Feature subset selection and feature ranking for multivariate time series , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Cong Wang,et al.  A Generic Method for Accelerating LSH-Based Similarity Join Processing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[27]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[28]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[29]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[30]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[31]  János Abonyi,et al.  Correlation based dynamic time warping of multivariate time series , 2012, Expert Syst. Appl..

[32]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[33]  Sotirios Chatzis,et al.  A hidden Markov model with dependence jumps for predictive modeling of multidimensional time-series , 2017, Inf. Sci..

[34]  Wilfred Ng,et al.  Locality-sensitive hashing scheme based on dynamic collision counting , 2012, SIGMOD Conference.

[35]  Evaggelia Pitoura,et al.  DisC diversity: result diversification based on dissimilarity and coverage , 2012, Proc. VLDB Endow..

[36]  Yuxin Peng,et al.  Complex activity recognition using time series pattern dictionary learned from ubiquitous sensors , 2016, Inf. Sci..

[37]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[38]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[39]  Zheng Zhang,et al.  Dynamic Time Warping under limited warping path length , 2017, Inf. Sci..

[40]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[41]  Eamonn J. Keogh,et al.  DTW-D: time series semi-supervised learning from a single example , 2013, KDD.

[42]  Guoyin Wang,et al.  Piecewise two-dimensional normal cloud representation for time-series data mining , 2016, Inf. Sci..

[43]  Sang-Wook Kim,et al.  Using multiple indexes for efficient subsequence matching in time-series databases , 2007, Inf. Sci..

[44]  Patrick Schäfer,et al.  Scalable time series classification , 2016, Data Mining and Knowledge Discovery.

[45]  Yan Liu,et al.  An Examination of Multivariate Time Series Hashing with Applications to Health Care , 2014, 2014 IEEE International Conference on Data Mining.

[46]  Chen Luo,et al.  SSH (Sketch, Shingle, & Hash) for Indexing Massive-Scale Time Series , 2016, NIPS Time Series Workshop.

[47]  Donato Malerba,et al.  Using multiple time series analysis for geosensor data forecasting , 2017, Inf. Sci..