Beyond one billion time series: indexing and mining very large time series collections with $$i$$SAX2+

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to index and mine very large collections of time series. Examples of such applications come from astronomy, biology, the web, and other domains. It is not unusual for these applications to involve numbers of time series in the order of hundreds of millions to billions. However, all relevant techniques that have been proposed in the literature so far have not considered any data collections much larger than one-million time series. In this paper, we describe $$i$$SAX 2.0 and its improvements, $$i$$SAX 2.0 Clustered and $$i$$SAX2+, three methods designed for indexing and mining truly massive collections of time series. We show that the main bottleneck in mining such massive datasets is the time taken to build the index, and we thus introduce a novel bulk loading mechanism, the first of this kind specifically tailored to a time series index. We show how our methods allows mining on datasets that would otherwise be completely untenable, including the first published experiments to index one billion time series, and experiments in mining massive data from domains as diverse as entomology, DNA and web-scale image collections.

[1]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[2]  Dimitrios Gunopulos,et al.  Indexing Large Human-Motion Databases , 2004, VLDB.

[3]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[4]  Klaus H. Hinrichs,et al.  Efficient Bulk Operations on Dynamic R-Trees , 1999, Algorithmica.

[5]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Bernhard Seeger,et al.  An Evaluation of Generic Bulk Loading Techniques , 2001, VLDB.

[7]  Elke A. Rundensteiner,et al.  GBI: A Generalized R-Tree Bulk-Insertion Strategy , 1999, SSD.

[8]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[9]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[10]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[11]  KeoghEamonn,et al.  Querying and mining of time series data , 2008, VLDB 2008.

[12]  Dimitrios Gunopulos,et al.  Streaming Time Series Summarization Using User-Defined Amnesic Functions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[13]  Jeffrey Rogers,et al.  An initial genetic linkage map of the rhesus macaque (Macaca mulatta) genome using human microsatellite loci. , 2006, Genomics.

[14]  Themis Palpanas,et al.  Scalable Similarity Matching in Streaming Time Series , 2012, PAKDD.

[15]  Eljas Soisalon-Soininen,et al.  Single and Bulk Updates in Stratified Trees: An Amortized and Worst-Case Analysis , 2003, Computer Science in Perspective.

[16]  W. F. Tjallingii,et al.  Characterisation of the feeding behaviour of western flower thrips in terms of electrical penetration graph (EPG) waveforms. , 2003, Journal of insect physiology.

[17]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[18]  Peter C. Andersen,et al.  Assimilation Efficiency of Free and Protein Amino Acids by Homalodisca vitripennis (Hemiptera: Cicadellidae: Cicadellinae) Feeding on Citrus sinensis and Vitis vinifera , 2009 .

[19]  John Mylopoulos,et al.  Strategic Management for Real-Time Business Intelligence , 2012, BIRTE.

[20]  Thad Starner,et al.  MAGIC 2.0: A web tool for false positive prediction and prevention for gesture recognition systems , 2011, Face and Gesture 2011.

[21]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.

[22]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[23]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[24]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[25]  Katsiaryna Mirylenka,et al.  Uncertain Time-Series Similarity: Return to the Basics , 2012, Proc. VLDB Endow..

[26]  Bernhard Seeger,et al.  A Generic Approach to Bulk Loading Multidimensional Index Structures , 1997, VLDB.

[27]  Ning An,et al.  Improving Performance with Bulk-Inserts in Oracle R-Trees , 2003, VLDB.

[28]  Paulo J. Azevedo,et al.  Time Series Motifs Statistical Significance , 2011, SDM.

[29]  Eamonn J. Keogh,et al.  A Probabilistic Approach to Fast Pattern Matching in Time Series Databases , 1997, KDD.

[30]  E. Backus,et al.  The AC-DC correlation monitor: New EPG design with flexible input resistors to detect both R and emf components for any piercing-sucking hemipteran. , 2009, Journal of insect physiology.

[31]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[32]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[33]  Paulo J. Azevedo,et al.  Multiresolution Motif Discovery in Time Series , 2010, SDM.