Optimal File-Bundle Caching Algorithms for Data-Grids

The file-bundle caching problem arises frequently in scientific applications where jobs process several files concurrently. Consider a host system in a data-grid that maintains a disk cache for servicing jobs of file requests where a job is serviced only if all its requested files are present in the disk cache. Files must now be admitted into the cache and replaced in sets of file-bundles. We show that traditional caching algorithms based on file popularity measures do not perform well since they may hold in cache non-relevant combinations of files. We present and analyze a new caching algorithm for maximizing the throughput of jobs and minimizing data replacement costs at such data-grid hosts. We tested the new algorithm using a disk cache simulation model under a wide range of conditions of file request distributions, varying cache size, file size distribution, etc. The results show significant improvement over traditional caching algorithms.

[1]  Neal E. Young,et al.  On-Line File Caching , 2002, SODA '98.

[2]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[3]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[4]  Arie Shoshani,et al.  Using bitmap index for interactive exploration of large datasets , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[5]  Arie Shoshani,et al.  Impact of Admission and Cache Replacement Policies on Response Times of Jobs on Data Grids , 2005, Cluster Computing.

[6]  Arie Shoshani,et al.  Accurate modeling of cache replacement policies in a data grid , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[7]  Reagan Moore,et al.  MySRB and SRB - components of a Data Grid , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[8]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[9]  Howard Jay Siegel,et al.  A mathematical model, heuristic, and simulation study for a basic data staging problem in a heterogeneous networking environment , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[10]  Dietmar Kaletta,et al.  Improved adaptive replacement algorithm for disk caches in HSM systems , 1999, 16th IEEE Symposium on Mass Storage Systems in cooperation with the 7th NASA Goddard Conference on Mass Storage Systems and Technologies (Cat. No.99CB37098).

[11]  Arie Shoshani,et al.  Storage resource managers: Middleware components for gridstorage , 2005 .

[12]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[13]  Reagan Moore,et al.  MySRB & SRB: Components of a Data Grid , 2002 .

[14]  Jia Wang,et al.  A survey of web caching schemes for the Internet , 1999, CCRV.

[15]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.