Efficiently supporting ad hoc queries in large datasets of time sequences

Ad hoc querying is difficult on very large datasets, since it is usually not possible to have the entire dataset on disk. While compression can be used to decrease the size of the dataset, compressed data is notoriously difficult to index or access. In this paper we consider a very large dataset comprising multiple distinct time sequences. Each point in the sequence is a numerical value. We show how to compress such a dataset into a format that supports ad hoc querying, provided that a small error can be tolerated when the data is uncompressed. Experiments on large, real world datasets (AT&T customer calling patterns) show that the proposed method achieves an average of less than 5% error in any data value after compressing to a mere 2.5% of the original space (i.e., a 40:1 compression ratio), with these numbers not very sensitive to dataset size. Experiments on aggregate queries achieved a 0.5% reconstruction error with a space requirement under 2%.

[1]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[2]  David W. Scott The New S Language , 1990 .

[3]  Jianzhong Li,et al.  A New Compression Method with Fast Searching on Large Databases , 1987, VLDB.

[4]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[5]  Edward A. Fox,et al.  Research Contributions , 2014 .

[6]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[8]  J. Byrnes Wavelets and their applications , 1994 .

[9]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[10]  Arie Shoshani,et al.  Efficient Access of Compressed Data , 1980, VLDB.

[11]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[12]  A. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[13]  Guy M. Lohman,et al.  Differential files: their application to the maintenance of large databases , 1976, TODS.

[14]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[16]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[17]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[18]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[19]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[20]  Manfred Schroeder,et al.  Fractals, Chaos, Power Laws: Minutes From an Infinite Paradise , 1992 .

[21]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[22]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[23]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[24]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[25]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[26]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[27]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[28]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[29]  Richard A. Becker,et al.  The New S Language , 1989 .

[30]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[31]  Mary Beth Ruskai,et al.  Wavelets and their Applications , 1992 .

[32]  D. Shasha,et al.  Hierarchically Split Cube Forests for Decision Support: description and tuned design , 1996 .