Filecules in High-Energy Physics: Characteristics and Impact on Resource Management

Grid computing has reached the stage where deployments are mature and many collaborations run in production mode. Mature grid deployments offer the opportunity for revisiting and perhaps updating traditional beliefs related to workload models, which in turn leads to the re-evaluation of traditional resource management techniques. This paper analyzes usage patterns in a typical grid community, a large-scale data-intensive scientific collaboration in high-energy physics. We focus mainly on data usage, since data is the major resource for this class of applications. Our observations led us to propose a new abstraction for resource management in scientific data analysis applications: we define a filecule as a group of files that is always used together. We show that filecules exist and present their characteristics. The existence of filecules suggests a new granularity for data management, which, if incorporated in design, can significantly outperform the traditional solutions for data caching, replication and placement based on single-file granularity. We reason about the impact of filecules on resource management and show compelling evidence for using this abstraction when designing data management services

[1]  Doron Rotem,et al.  File Caching in Data Intensive Scientific Applications on Data-Grids , 2005, DMG.

[2]  Baohua Wei Collaborative Data Distribution with BitTorrent for Computational Desktop Grids , 2005, The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05).

[3]  Martin F. Arlitt,et al.  Workload characterization of a Web proxy in a cable modem environment , 1999, PERV.

[4]  Neal E. Young,et al.  On-Line File Caching , 2002, SODA '98.

[5]  Ian T. Foster,et al.  On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing , 2003, IPTPS.

[6]  Andrea C. Arpaci-Dusseau,et al.  Pipeline and batch sharing in grid workloads , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[7]  Gilles Fedak,et al.  Scheduling independent tasks sharing large data distributed with BitTorrent , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[8]  Eytan Adar,et al.  Free Riding on Gnutella , 2000, First Monday.

[9]  Adam Wierzbicki,et al.  Deconstructing the Kazaa network , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[10]  L. Lueking,et al.  SAM and the Particle Physics Data Grid , 2001 .

[11]  M.S. Allen,et al.  The Livny and Plank-Beck Problems: Studies in Data Movement on the Computational Grid , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[12]  M. Frans Kaashoek,et al.  Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files , 1997, USENIX Annual Technical Conference.

[13]  Ian T. Foster,et al.  Small-world file-sharing communities , 2003, IEEE INFOCOM 2004.

[14]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[15]  Krishna P. Gummadi,et al.  An analysis of Internet content delivery systems , 2002, OPSR.

[16]  Doron Rotem,et al.  Efficient Algorithms for Multi-file Caching , 2004, DEXA.

[17]  Dan Duchamp,et al.  Detection and exploitation of file working sets , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[18]  Ian T. Foster,et al.  Interest-aware information dissemination in small-world communities , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[19]  Ian T. Foster,et al.  Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design , 2002, ArXiv.

[20]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[21]  Paul Avery,et al.  The griphyn project: towards petascale virtual data grids , 2001 .

[22]  Randal C. Burns,et al.  Group-based management of distributed file caches , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[23]  Edith Cohen,et al.  Associative search in peer to peer networks: harnessing latent semantics , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[24]  Lada A. Adamic,et al.  Search in Power-Law Networks , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Doron Rotem,et al.  Optimal File-Bundle Caching Algorithms for Data-Grids , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Ian T. Foster,et al.  Mapping the Gnutella Network , 2002, IEEE Internet Comput..

[27]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[29]  Patrick Fuhrmann dCache, the Commodity Cache , 2004, MSST.

[30]  I. Terekhov,et al.  Meta-computing at D0 , 2003 .

[31]  Jim Griffioen,et al.  Reducing File System Latency using a Predictive Approach , 1994, USENIX Summer.

[32]  Fons Rademakers,et al.  ROOT — An object oriented data analysis framework , 1997 .

[33]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .