Creating personal adaptive clusters for managing scientific jobs in a distributed computing environment

We describe a system for creating personal clusters in user-space to support the submission and management of thousands of compute-intensive serial jobs to the network-connected compute resources on the NSF TeraGrid. The system implements a robust infrastructure that submits and manages job proxies across a distributed computing environment. These job proxies contribute resources to personal clusters created dynamically for a user on-demand. The system adapts to the prevailing job load conditions at the distributed sites by migrating job proxies to sites expected to provide resources more quickly. The version of the system described in this paper allows users to build large personal Condor and Sun Grid Engine clusters on the TeraGrid. Users can then submit, monitor and control their scientific jobs with a single uniform interface, using the feature-rich functionality found in these job management environments. Up to 100,000 user jobs have been submitted through the system to date, enabling approximately 900 teraflops of scientific computation

[1]  Adam Arbree,et al.  Mapping Abstract Complex Workflows onto Grid Environments , 2003, Journal of Grid Computing.

[2]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[3]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[4]  Ian Foster,et al.  The Globus toolkit , 1998 .

[5]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[7]  David Abramson,et al.  Nimrod/G: an architecture for a resource management and scheduling system in a global computational grid , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[8]  Tommy Minyard,et al.  Orchestrating and coordinating scientific/engineering workflows using GridShell , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..