Self-tuning management of update-intensive multidimensional data in clusters of workstations

Contemporary applications continuously modify large volumes of multidimensional data that must be accessed efficiently and, more importantly, must be updated in a timely manner. Single-server storage approaches are insufficient when managing such volumes of data, while the high frequency of data modification render classical indexing methods inefficient. To address these two problems we introduce a distributed storage manager for multidimensional data based on a Cluster-of-Workstations. The manager addresses the above challenges through a set of mechanisms that, through selective on-line data reorganization, collectively maintain a balanced load across a cluster of workstations. With the help of both a highly efficient and speedy self-tuning mechanism, based on a new data structure called stat-index, as well as a query aggregation and clustering algorithm, our storage manager attains short query response times even in the presence of massive modifications and highly skewed access patterns. Furthermore, we provide a data migration cost model used to determine the best data redistribution strategy. Through extensive experimentation with our prototype, we establish that our storage manager can sustain significant update rates with minimal overhead.

[1]  Alexander S. Szalay,et al.  Petabyte Scale Data Mining: Dream or Reality? , 2002, SPIE Astronomical Telescopes + Instrumentation.

[2]  Christian S. Jensen,et al.  Indexing the positions of continuously moving objects , 2000, SIGMOD '00.

[3]  Nick Roussopoulos,et al.  Cubetree: organization of and bulk incremental updates on the data cube , 1997, SIGMOD '97.

[4]  George Kollios,et al.  Spatio-temporal data services in a shared-nothing environment , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[5]  George Kollios,et al.  Management of Highly Dynamic Multidimensional Data in a Cluster of Workstations , 2004, EDBT.

[6]  Shivendra S. Panwar,et al.  TCP/IP Essentials: A Lab-Based Approach , 2004 .

[7]  Oded Shmueli,et al.  An efficient method for distributing search structures , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[8]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[9]  Witold Litwin,et al.  k-RP*s: a scalable distributed data structure for high-performance multi-attribute access , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[10]  Walid G. Aref,et al.  Bulk operations for space-partitioning trees , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  S. Sitharama Iyengar,et al.  Concurrent maintenance of data systems for telecommunications , 1988 .

[12]  Jim Gray,et al.  Microsoft TerraServer , 1998, SIGMOD 2000.

[13]  Donald R. Slutz,et al.  TerraServer: A Spatial Data Warehouse. , 2000, SIGMOD 2000.

[14]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[15]  Prashant J. Shenoy,et al.  SensEye: a multi-tier camera sensor network , 2005, ACM Multimedia.

[16]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[17]  Samir Khuller,et al.  Algorithms for Data Migration with Cloning , 2004, SIAM J. Comput..

[18]  Witold Litwin,et al.  LH* - Linear Hashing for Distributed Files , 1993, SIGMOD Conference.

[19]  Elke A. Rundensteiner,et al.  Bulk-insertions into R-trees using the small-tree-large-tree approach , 1998, GIS '98.

[20]  Christian Böhm,et al.  Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations , 1998, EDBT.

[21]  Yannis Manolopoulos,et al.  Nearest Neighbor Queries in Shared-Nothing Environments , 1997, GeoInformatica.

[22]  Tetsuji Satoh,et al.  An index structure for parallel database processing , 1992, [1992 Proceedings] Second International Workshop on Research Issues on Data Engineering: Transaction and Query Processing.

[23]  Klaus H. Hinrichs,et al.  Efficient Bulk Operations on Dynamic R-Trees , 1999, Algorithmica.

[24]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[25]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[26]  Betty Salzberg,et al.  On-line reorganization of sparsely-populated B+-trees , 1996, SIGMOD '96.

[27]  Joseph Hall,et al.  On algorithms for efficient data migration , 2001, SODA '01.

[28]  Jeffrey F. Naughton,et al.  Caching multidimensional queries using chunks , 1998, SIGMOD '98.

[29]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[30]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[31]  Vassilis J. Tsotras,et al.  Comparison of access methods for time-evolving data , 1999, CSUR.

[32]  Beng Chin Ooi,et al.  R-tree-based data migration and self-tuning strategies in shared-nothing spatial databases , 2001, GIS '01.

[33]  Christos Faloutsos,et al.  Parallel R-trees , 1992, SIGMOD '92.

[34]  Panos K. Chrysanthis,et al.  A taxonomy of correctness criteria in database applications , 1996, The VLDB Journal.

[35]  Carla Schlatter Ellis,et al.  Distributed data structures: A case study , 1985, IEEE Transactions on Computers.

[36]  A. Colbrook,et al.  Distributed indices for accessing distributed data , 1993, [1993] Proceedings Twelfth IEEE Symposium on Mass Storage systems.

[37]  Bernhard Seeger,et al.  An Evaluation of Generic Bulk Loading Techniques , 2001, VLDB.

[38]  Elke A. Rundensteiner,et al.  GBI: A Generalized R-Tree Bulk-Insertion Strategy , 1999, SSD.

[39]  Jonas S. Karlsson hQT*: A Scalable Distributed Data Structure for High-Performance Spatial Accesses , 1998, FODO.

[40]  Scott T. Leutenegger,et al.  Master-client R-trees: a new parallel R-tree architecture , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[41]  Betty Salzberg,et al.  Safely and Efficiently Updating References During On-line Reorganization , 1998, VLDB.

[42]  Aleksandra Smiljanić Flexible bandwidth allocation in high-capacity packet switches , 2002, TNET.

[43]  S. Sitharama Iyengar,et al.  Concurrent Maintenance of Data Structures in a Distributed Environment , 1988, Computer/law journal.

[44]  Divyakant Agrawal,et al.  Automated Storage Management with QoS Guarantee in Large-scale Virtualized Storage Systems , 2006, IEEE Data Eng. Bull..

[45]  Marcel Kornacker,et al.  High-Concurrency Locking in R-Trees , 1995, VLDB.

[46]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[47]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.

[48]  Yannis Manolopoulos,et al.  Parallel bulk-loading of spatial data , 2003, Parallel Comput..

[49]  Xiaowei Sun,et al.  Online B-tree merging , 2005, SIGMOD '05.

[50]  D. Butler,et al.  The Earth Observing System Data and Information System , 1991 .

[51]  Beng Chin Ooi,et al.  Towards self-tuning data placement in parallel database systems , 2000, SIGMOD '00.

[52]  Hosagrahar V. Jagadish,et al.  Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990. , 1990, SIGMOD 1990.

[53]  Peter Widmayer,et al.  Distributing a search tree among a growing number of processors , 1994, SIGMOD '94.

[54]  Nick Roussopoulos,et al.  Cubetree: Organization of and Bulk Updates on the Data Cube , 1997, SIGMOD Conference.

[55]  Edward D. Lazowska,et al.  A comparison of receiver-initiated and sender-initiated adaptive load sharing (extended abstract) , 1985, SIGMETRICS 1985.

[56]  Curt J. Ellmann,et al.  Building a Scalable GeoSpatial DBMS : Technology , Implementation , and Evaluation , 1997 .

[57]  Bernhard Seeger,et al.  A Generic Approach to Bulk Loading Multidimensional Index Structures , 1997, VLDB.

[58]  Euthimios Panagos,et al.  Synchronization and recovery in a client-server storage system , 1997, The VLDB Journal.

[59]  Gerhard Weikum,et al.  Data partitioning and load balancing in parallel disk systems , 1998, The VLDB Journal.

[60]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[61]  Edward D. Lazowska,et al.  A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing , 1986, Perform. Evaluation.

[62]  Sharad Mehrotra,et al.  Querying Mobile Objects in Spatio-Temporal Databases , 2001, SSTD.

[63]  Timos K. Sellis,et al.  A model for the prediction of R-tree performance , 1996, PODS.

[64]  David J. DeWitt,et al.  Building a scaleable geo-spatial DBMS: technology, implementation, and evaluation , 1997, SIGMOD '97.

[65]  Dan Pritchett,et al.  BASE: An Acid Alternative , 2008, ACM Queue.

[66]  Yufei Tao,et al.  Range aggregate processing in spatial databases , 2004, IEEE Transactions on Knowledge and Data Engineering.

[67]  Walter S. Scott,et al.  Magic: A VLSI Layout System , 1984, 21st Design Automation Conference Proceedings.

[68]  Christos Faloutsos,et al.  Deflating the dimensionality curse using multiple fractal dimensions , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[69]  Christos Faloutsos,et al.  Declustering Spatial Databases on a Multi-Computer Architecture , 1996, EDBT.

[70]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .