CASINO TIMES: Compression and Similarity Indexing for Time Series

The detection of similarities withing the time series provided by the Google n-gram data can help researchers to explore and understand relationships between concrete words and abstract concepts. We construct a way of expressing this similarity and explain why this is, in our opinion, a sane approach. Another important aspect of handling this kind of data in large scale the existence of an index structure for this task. We show how the prior art performs and why the time series data set is different from many other data sets. We then explore another way of exploiting similar information between time series to construct an index and compress the data at the same time. Afterwards, we show why our approach might have some essential issues and discuss some possible workarounds. After presenting our way of implementing tools in this domain, we run a wide set of evaluations regarding the semantic meaning of our similarity metric, the effect of the compression on the query processing, the filter design that uses our system as an index and finally performance measurements. While the selected baseline provides sane result, it turns out that our approach suffers from the fact that greedy-decisions are only optimal in a local domain. In a global perspective they are often the wrong decisions and therefore we are unable to reach good compression rates while keeping the distortion low. We end this work with some conclusions about the, not necessary great, results, and give an outlook on future research and what, in our opinion, might be other approaches worth to try.

[1]  Jignesh M. Patel,et al.  An efficient and accurate method for evaluating time series similarity , 2007, SIGMOD '07.

[2]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[3]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[4]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Pietro Perona,et al.  Continuous dynamic time warping for translation-invariant curve alignment with applications to signature verification , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[8]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[9]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[10]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[11]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[12]  Scott Meyers,et al.  Effective C++: 55 Specific Ways to Improve Your Programs and Designs (3rd Edition) , 1991 .

[13]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[14]  Klemens Böhm,et al.  Estimating mutual information on data streams , 2015, SSDBM.

[15]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[16]  P. Eilers Parametric time warping. , 2004, Analytical chemistry.

[17]  Daniel Lemire,et al.  Faster retrieval with a two-pass dynamic-time-warping lower bound , 2008, Pattern Recognit..

[18]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[19]  Eugene Fink,et al.  Compression of time series by extracting major extrema , 2011, J. Exp. Theor. Artif. Intell..

[20]  M. Feindt A Neural Bayesian Estimator for Conditional Probability Densities , 2004, physics/0402093.

[21]  A. Jovic,et al.  Feature Extraction for ECG Time-Series Mining Based on Chaos Theory , 2007, International Conference on Information Technology Interfaces.

[22]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[23]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[24]  Irem Uz Individualism and First Person Pronoun Use in Written Texts Across Languages , 2014 .

[25]  Eamonn J. Keogh,et al.  Derivative Dynamic Time Warping , 2001, SDM.

[26]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[27]  Rada Mihalcea,et al.  Mining semantic affordances of visual object categories , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  P. Wason On the Failure to Eliminate Hypotheses in a Conceptual Task , 1960 .

[29]  Eduardo G. Altmann,et al.  Extracting information from S-curves of language change , 2014, Journal of The Royal Society Interface.

[30]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[31]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[32]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[33]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[34]  Michael E. W. Varnum,et al.  Social Structure, Infectious Diseases, Disasters, Secularism, and Cultural Change in America , 2014, Psychological science.

[35]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[36]  A. Belward,et al.  The Best Index Slope Extraction ( BISE): A method for reducing noise in NDVI time-series , 1992 .

[37]  Clifford R. Mynatt,et al.  Confirmation Bias in a Simulated Research Environment: An Experimental Study of Scientific Inference , 1977 .

[38]  Jaakko Astola,et al.  Tree-Structured Haar Transforms , 2004, Journal of Mathematical Imaging and Vision.

[39]  Jian Li,et al.  RACE: time series compression with rate adaptivity and error bound for sensor networks , 2004, 2004 IEEE International Conference on Mobile Ad-hoc and Sensor Systems (IEEE Cat. No.04EX975).

[40]  Scott Meyers,et al.  Effective modern C++: 42 specific ways to improve your use of C++11 and C++14 , 2014 .

[41]  Joseph B. Kruskall,et al.  The Symmetric Time-Warping Problem : From Continuous to Discrete , 1983 .

[42]  Yang Zhang,et al.  Unsupervised Feature Extraction for Time Series Clustering Using Orthogonal Wavelet Transform , 2006, Informatica.

[43]  S. Schulz-Hardt,et al.  Confirmation bias in sequential information search after preliminary decisions: an expansion of dissonance theoretical research on selective exposure to information. , 2001, Journal of personality and social psychology.

[44]  R. Nickerson Confirmation Bias: A Ubiquitous Phenomenon in Many Guises , 1998 .

[45]  Eamonn J. Keogh,et al.  iSAX: indexing and mining terabyte sized time series , 2008, KDD.