Exploiting locality for query processing and compression in scientific databases

Improvements in the efficiency of scientific simulations have lead to requirements of large databases. The data captured in such simulations is of large scale and poses challenges in storage, transfer and query processing. However, the data are collected every fraction of a second, storing some redundant information. Thus, the temporal and spatial locality of the data gives us an opportunity to store and transfer over networks efficiently. The data locality also helps in efficiently processing complex analytical queries that are popular in scientific databases. Many scientific data analysis queries involve more than one object/body of interest. Processing such queries pose super-linear computational complexity. In this paper, we propose preliminary solutions to some of these problems along with initial results. Mainly, we try to exploit the spatial and temporal proximity of the data to achieve high levels of compression for efficient storage and analytical query processing.

[1]  A. Winsor Sampling techniques. , 2000, Nursing times.

[2]  David Salomon,et al.  Data Compression , 2000, Springer Berlin Heidelberg.

[3]  Charles A Laughton,et al.  Essential Dynamics:  A Tool for Efficient Trajectory Compression and Management. , 2006, Journal of chemical theory and computation.

[4]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[5]  David G. Stork,et al.  Pattern Classification , 1973 .

[6]  Daniel Kifer,et al.  A Vision for PetaByte Data Management and Analyis Services for the Arecibo Telescope , 2004, IEEE Data Eng. Bull..

[7]  B. Montgomery Pettitt,et al.  Large scale distributed data repository: design of a molecular dynamics trajectory database , 1999, Future Gener. Comput. Syst..

[8]  Berend Smit,et al.  Understanding Molecular Simulations: from Algorithms to Applications , 2002 .

[9]  Norbert Attig,et al.  Introduction to Molecular Dynamics Simulation , 2004 .

[10]  Scott Klasky,et al.  The Center for Plasma Edge Simulation Workflow Requirements , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[11]  Berend Smit,et al.  Understanding Molecular Simulation , 2001 .

[12]  Jonathan W. Essex,et al.  BioSimGrid: Grid-enabled biomolecular simulation data storage and analysis , 2006, Future Gener. Comput. Syst..

[13]  Rajiv K. Kalia,et al.  Scalable I/O of large-scale molecular dynamics simulations: A data-compression algorithm , 2000 .

[14]  J. Starck,et al.  Astronomical Image and Data Analysis (Astronomy and Astrophysics Library) , 2006 .

[15]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[16]  Yannis E. Ioannidis,et al.  Histogram-Based Approximation of Set-Valued Query-Answers , 1999, VLDB.

[17]  Praveen Seshadri,et al.  An algebraic compression framework for query results , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  Shaoping Chen,et al.  Computing Distance Histograms Ef?ciently in Scientific Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Jayant R. Haritsa,et al.  Database Compression: A Performance Enhancement Tool , 1995, COMAD.

[20]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[21]  Istv'an Szapudi Introduction to Higher Order Spatial Statistics in Cosmology , 2005 .

[22]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[23]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[24]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[25]  Jean-Luc Starck,et al.  Astronomical image and data analysis , 2002 .

[26]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[27]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[28]  Ting Wang,et al.  DSMM: a Database of Simulated Molecular Motions , 2003, Nucleic Acids Res..