Storage and Recreation Trade-Off for Multi-version Data Management

With the tremendous development of data acquisition technology, massive observation data have been accumulated in scientific disciplines. As the difference between the successive observations only changes slightly, it is critical to utilize multi-version data management technology to compress data to minimize both storage and recreation. However, the existing work on this field only optimizes the total storage and recreation costs, but ignores the recreation cost of some special versions. Consequently, in this paper, we investigate the trade-off among all of three metrics, including total storage cost, total recreation cost, and the maximum recreation cost for each version. We formulate two problems, including (1) discover a storage plan to lower the total recreation and the individual recreation if the total storage is limited; (2) find a storage plan to minimize the total storage with restricted total recreation and individual recreation. To solve above problems, we model all versions with a directed graph and then devise two efficient algorithms based on spanning tree. A series of experiments indicate that our proposals are effective and efficient in dealing with the problems.

[1]  Michael Stonebraker,et al.  A Demonstration of SciDB: A Science-Oriented DBMS , 2009, Proc. VLDB Endow..

[2]  Amol Deshpande,et al.  DEX: Query Execution in a Delta-based Storage System , 2017, SIGMOD Conference.

[3]  Anjana Gosain,et al.  Storage Structure for Handling Schema Versions in Temporal Data Warehouses , 2018 .

[4]  Aditya G. Parameswaran,et al.  Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff , 2015, Proc. VLDB Endow..

[5]  R. Prim Shortest connection networks and some generalizations , 1957 .

[6]  Magdalena Balazinska,et al.  Efficient iterative processing in the SciDB parallel array engine , 2015, SSDBM.

[7]  Wei Xu,et al.  DataLab: A Version Data Management and Analytics System , 2016, 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE).

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Magdalena Balazinska,et al.  Time travel in a scientific array database , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Kien A. Hua,et al.  Efficient Sub-Window Nearest Neighbor Search on Matrix , 2017, IEEE Transactions on Knowledge and Data Engineering.

[11]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[12]  Peter Baumann Standardizing big earth datacubes , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[13]  Hideyuki Kawashima,et al.  Efficient Window Aggregate Method on Array Database System , 2016, J. Inf. Process..

[14]  Michael Stonebraker,et al.  Efficient Versioning for Scientific Array Databases , 2012, 2012 IEEE 28th International Conference on Data Engineering.