Co-Scheduling of Computation and Data on Computer Clusters

Scientific investigations have to deal with rapidly growing amounts of data from simulations and experiments. During data analysis, scientists typically want to extract subsets of the data and perform computations on them. In order to speed up the analysis, computations are performed on distributed systems such as computer clusters, or Grid systems. A well-known difficult problem is to build systems that execute the computations and data movement in a coordinated fashion. In this paper, we describe an architecture for executing co-scheduled tasks of computation and data movement on a computer cluster that takes advantage of two technologies currently being used in distributed Grid systems. The first is Condor, that manages the scheduling and execution of distributed computation, and the second is Storage Resource Managers (SRMs) that manage the space usage and content of storage systems. This is achieved by including the information about the availability of files on the nodes provided by SRMs into the advertised information that Condor uses for the purpose of matchmaking. The system is capable of dynamically load balancing by replicating popular files on idle nodes. To confirm the feasibility of our approach, a prototype system was built on a computer cluster. Several experiments based on real work logs were performed. We observed that without replication compute nodes are underutilized and job wait times in the scheduler’s queue are longer. This architecture can be used in wide-area Grid systems since the basic components are already used for the Grid. Visiting LBNL from the Computer Sciences Department, University of Wisconsin

[1]  Miron Livny,et al.  Scheduling Data Placement Activities in Grid , 2003 .

[2]  Jacek Becla,et al.  Lessons Learned from Managing a Petabyte , 2005, CIDR.

[3]  Jon B. Weissman,et al.  Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters , 2002, J. Parallel Distributed Comput..

[4]  Arie Shoshani,et al.  Coordination of data movement with computation scheduling on a cluster , 2005, CLADE 2005. Proceedings Challenges of Large Applications in Distributed Environments, 2005..

[5]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[6]  David Abramson,et al.  Economic models for management of resources in peer-to-peer and grid computing , 2001, SPIE ITCom.

[7]  Kurt Stockinger,et al.  Simulation of Dynamic Grid Replication Strategies in OptorSim , 2002, GRID.

[8]  Kurt Stockinger,et al.  PRELIMINARY EVALUATION OF REVENUE PREDICTION FUNCTIONS FOR ECONOMICALLY-EFFECTIVE FILE REPLICATION , 2002 .

[9]  Arie Shoshani,et al.  Storage resource managers: essential components for the Grid , 2003 .

[10]  Floriano Zini,et al.  Evaluation of an economy-based file replication strategy for a data grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[11]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.