Supporting load balancing and efficient reorganization during system scaling

Reorganization becomes constantly necessary for maintaining load balancing as distributed storage systems scale up and down. To support load balancing and efficient reorganization during system scaling, we propose a new hashing method called prime based hashing (PBH) that can be used for data allocation in large distributed systems. PBH distributes objects among storage units based on residues (congruence) of hash-transformed key values modulo prime numbers. PBH provides nearly perfect load balancing, distributes objects evenly and rebalances to preserve the even distribution as system scales. At the same time it facilitates cost-effective reorganization by minimizing data migration during system scaling. Locating an object in PBH is fast through low complexity computations, requiring only the knowledge of the total number of storage units. We also propose a local data clustering method to couple with PBH to make reorganization more efficient. Objects are clustered according to the order of migration so that only the part of the data that needs to be migrated is scanned. In addition, we show that by storing a small amount of pre-computed information, ordering of objects for clustering can be very efficient. We demonstrate through analysis and experiments the effectiveness of our algorithms.

[1]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[2]  Ashish Goel,et al.  SCADDAR: an efficient randomized technique to reorganize continuous media blocks , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[4]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[5]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[6]  Mark Handley,et al.  Application-Level Multicast Using Content-Addressable Networks , 2001, Networked Group Communication.

[7]  Witold Litwin,et al.  LH*—a scalable, distributed data structure , 1996, TODS.

[8]  Marianne Winslett,et al.  Interview with Jim Gray , 2003, SIGMOD record.

[9]  Witold Litwin,et al.  Virtual Hashing: A Dynamically Changing Hashing , 1978, VLDB.

[10]  Per-Åke Larson,et al.  Linear Hashing with Partial Expansions , 1980, VLDB.

[11]  Witold Litwin,et al.  Linear Hashing: A new Algorithm for Files and Tables Addressing , 1980, ICOD.

[12]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[13]  Per-Åke Larson,et al.  Performance analysis of linear hashing with partial expansions , 1982, TODS.

[14]  Ronald Fagin,et al.  Extendible hashing—a fast access method for dynamic files , 1979, ACM Trans. Database Syst..

[15]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.