TRISTAN: Real-time analytics on massive time series using sparse dictionary compression

Large-scale critical infrastructures such as transportation, energy, or water distribution networks are increasingly equipped with smart sensor technologies. Low-latency analytics on the resulting times series would open the door to many exciting opportunities to improve our grasp on complex urban systems. However, sensor-generated time series often turn out to be noisy, non-uniformly sampled, and misaligned in practice, making them ill-suited for traditional data processing. In this paper, we introduce TRISTAN (massive TRIckletS Time series ANalysis), a new data management system for efficient storage and real-time processing of fine-grained time series data. TRISTAN relies on a dedicated, compressed sparse representation of the time series using a dictionary. In contrast to previous approaches, TRISTAN is able to execute most analytics queries on the compressed data directly, and supports efficient and approximate query answering based on the most significant atoms of the dictionary only. We present the overall architecture of our system and discuss its performance on several smarter city datasets, showing that TRISTAN can achieve up to 20:1 compression ratios and 250x speedup compared to a state-of-the-art system.

[1]  David J. DeWitt,et al.  Materialization Strategies in a Column-Oriented DBMS , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[2]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Eamonn J. Keogh,et al.  iSAX 2.0: Indexing and Mining One Billion Time Series , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[5]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[6]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[7]  Michael Elad,et al.  Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit , 2008 .

[8]  Mike E. Davies,et al.  A fast importance sampling algorithm for unsupervised learning of over-complete dictionaries , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Theodore Johnson,et al.  Consistency in a Stream Warehouse , 2011, CIDR.

[10]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[11]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[12]  Suman Nath,et al.  Managing Massive Time Series Streams with MultiScale Compressed Trickles , 2009, Proc. VLDB Endow..

[13]  Guillermo Sapiro,et al.  Hierarchical dictionary learning for invariant classification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[15]  M PatelJignesh,et al.  Dictionary-Based Compression for Long Time-Series Similarity , 2010 .

[16]  Jignesh M. Patel,et al.  Dictionary-Based Compression for Long Time-Series Similarity , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[18]  C. Sidney Burrus,et al.  Multidimensional, mapping-based complex wavelet transforms , 2005, IEEE Transactions on Image Processing.

[19]  Vikas Sindhwani,et al.  Large-scale distributed non-negative sparse coding and sparse dictionary learning , 2012, KDD.

[20]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[21]  Kjersti Engan,et al.  Recursive Least Squares Dictionary Learning Algorithm , 2010, IEEE Transactions on Signal Processing.

[22]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.

[23]  T. Blumensath,et al.  Faster & greedier: algorithms for sparse reconstruction of large datasets , 2008, 2008 3rd International Symposium on Communications, Control and Signal Processing.

[24]  Eric Bouillet,et al.  MiSTRAL: An architecture for low-latency analytics on MasSive time series , 2013, 2013 IEEE International Conference on Big Data.

[25]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[26]  Mike E. Davies,et al.  Sparse and shift-Invariant representations of music , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[28]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[29]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[30]  Martin Grund,et al.  An overview of HYRISE - a Main Memory Hybrid Storage Engine , 2012, IEEE Data Eng. Bull..

[31]  Demetrios Zeinalipour-Yazti,et al.  Ieee Icdm 2010 Contest Tomtom Traffic Prediction for Intelligent Gps Navigation , 2022 .

[32]  Eamonn J. Keogh,et al.  Time Series Classification under More Realistic Assumptions , 2013, SDM.