GridX1: A Canadian computational grid

The present paper discusses the design and application of GridX1, a computational grid project which uses shared resources at several Canadian research institutions. The infrastructure of GridX1 is built using off-the-shelf Globus Toolkit 2 middleware, a MyProxy credential server, and a resource broker based on Condor-G to manage the distributed computing environment. The broker-based job scheduling and management functionality are exposed as a Globus GRAM job service. Resource brokering is based on the Condor matchmaking mechanism, whereby job and resource attributes are expressed as ClassAds, with the attributes Requirements and Rank being used to define respectively the constraints and preferences that the matched entity must meet. Various strategies for ranking resources are presented, including an Estimated-Waiting-Time (EWT) algorithm, a throttled load balancing strategy, and a novel external ranking strategy based on data location. One of the unique features is a mechanism which transparently presents the GridX1 resources as a single compute element to the LHC Computing Grid (LCG), based at the CERN Laboratory in Geneva. This interface was used during the ATLAS data challenge 2 to federate the Canadian resources into the LCG without the overhead of maintaining separate LCG sites. Further, the BaBar particle physics simulation has been adapted to execute on GridX1 and resulted in a simplified management of the production. The usage of the throttled EWT and load balancing strategies combined with external data ranking was found to be very effective in improving efficiency and reducing the job failure rate.

[1]  Steven Tuecke,et al.  An online credential repository for the Grid: MyProxy , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[2]  Flavia Donno,et al.  The INFN-Grid Testbed , 2005, Future Gener. Comput. Syst..

[3]  Jarek Nabrzyski,et al.  GridLab--a grid application toolkit and testbed , 2002, Future Gener. Comput. Syst..

[4]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[5]  Peter H. Beckman,et al.  The Inca Test Harness and Reporting Framework , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[6]  Iosif Legrand,et al.  MonALISA : A Distributed Monitoring Service Architecture , 2003, ArXiv.

[7]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[8]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[9]  Anders Wäänänen,et al.  Advanced resource connector middleware for lightweight computational Grids , 2007 .

[10]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[11]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[12]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[13]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[14]  Anne E. Trefethen,et al.  The UK e-Science Core Programme and the Grid , 2002, Future Gener. Comput. Syst..

[15]  Roger Impey,et al.  Federating Grids : LCG meets Canadian HEPGrid , 2005 .