Scalable community-driven data sharing in e-science grids

E-science projects of various disciplines face a fundamental challenge: thousands of users want to obtain new scientific results by application-specific and dynamic correlation of data from globally distributed sources. Considering the involved enormous and exponentially growing data volumes, centralized data management reaches its limits. Since scientific data are often highly skewed and exploration tasks exhibit a large degree of spatial locality, we propose the locality-aware allocation of data objects onto a distributed network of interoperating databases. HiSbase is an approach to data management in scientific federated Data Grids that addresses the scalability issue by combining established techniques of database research in the field of spatial data structures (quadtrees), histograms, and parallel databases with the scalable resource sharing and load balancing capabilities of decentralized Peer-to-Peer (P2P) networks. The proposed combination constitutes a complementary e-science infrastructure enabling load balancing and increased query throughput.

[1]  D. Hilbert Ueber die stetige Abbildung einer Line auf ein Flächenstück , 1891 .

[2]  Kirk Pruhs,et al.  KDDCS: a load-balanced in-network data-centric storage scheme for sensor networks , 2006, CIKM '06.

[3]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[4]  Hector Garcia-Molina,et al.  Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems , 2004, VLDB.

[5]  Alfons Kemper,et al.  Grid-Based Data Stream Processing in e-Science , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[6]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[7]  Wolfgang Voges,et al.  Detection of X-ray clusters of galaxies by matching RASS photons and SDSS galaxies within GAVO , 2004, astro-ph/0403116.

[8]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[9]  Farnoush Banaei Kashani,et al.  SWAM: a family of access methods for similarity-search in peer-to-peer data networks , 2004, CIKM '04.

[10]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[11]  Hanan Samet,et al.  Using a distributed quadtree index in peer-to-peer networks , 2007, The VLDB Journal.

[12]  D. Hilbert Über die stetige Abbildung einer Linie auf ein Flächenstück , 1935 .

[13]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[14]  Bernhard Bauer,et al.  HiSbase: Histogram-based P2P Main Memory Data Management , 2007, VLDB.

[15]  Beng Chin Ooi,et al.  Supporting multi-dimensional range queries in peer-to-peer systems , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[16]  Theoni Pitoura,et al.  Replication, Load Balancing and Efficient Range Query Processing in DHTs , 2006, EDBT.

[17]  James Aspnes,et al.  Skip graphs , 2003, SODA '03.

[18]  Alfons Kemper,et al.  Community Training: Partitioning Schemes in Good Shape for Federated Data Grids , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[19]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[20]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[21]  Ashwin Machanavajjhala,et al.  P-ring: an efficient and robust P2P range index structure , 2007, SIGMOD '07.

[22]  Hector Garcia-Molina,et al.  One torus to rule them all: multi-dimensional queries in P2P systems , 2004, WebDB '04.

[23]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[24]  Alexander S. Szalay,et al.  The world-wide telescope , 2001, CACM.

[25]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[26]  Rajkumar Buyya,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2005, CSUR.

[27]  Alexander S. Szalay,et al.  The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets , 2007, ArXiv.

[28]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[29]  Yannis E. Ioannidis,et al.  Query optimization in distributed networks of autonomous database systems , 2006, TODS.

[30]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[31]  Nolan Li,et al.  Batch is back: CasJobs, serving multi-TB data on the Web , 2005, IEEE International Conference on Web Services (ICWS'05).

[32]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[33]  Alfons Kemper,et al.  Quality of service in an information economy , 2003, TOIT.

[34]  Thomas Röblitz,et al.  AstroGrid-D: Enhancing Astronomic Science with Grid Technology , 2007 .

[35]  BuyyaRajkumar,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2006 .

[36]  Jonathan Kirsch,et al.  Load balancing and locality in range-queriable data structures , 2004, PODC '04.

[37]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.