A REST model for high throughput scheduling in computational grids

Current grid computing architectures have been based on cluster management and batch queuing systems, extended to a distributed, federated domain. These have shown shortcomings in terms of scalability, stability, and modularity. To address these problems, this dissertation applies architectural styles from the Internet and Web to the domain of generic computational grids. Using the REST style, a flexible model for grid resource interaction is developed which removes the need for any centralised services or specific protocols, thereby allowing a range of implementations and layering of further functionality. The context for resource interaction is a generalisation and formalisation of the Condor ClassAd match-making mechanism. This set theoretic model is described in depth, including the advantages and features which it realises. This RESTful style is also motivated by operational experience with existing grid infrastructures, and the design, operation, and performance of a proto-RESTful grid middleware package named DIRAC. This package was designed to provide for the LHCb particle physics experiment’s “off-line” computational infrastructure, and was first exercised during a 6 month data challenge which utilised over 670 years of CPU time and produced 98 TB of data through 300,000 tasks executed at computing centres around the world. The design of DIRAC and performance measures from the data challenge are reported. The main contribution of this work is the development of a REST model for grid resource interaction. In particular, it allows resource templating for scheduling queues which provide a novel distributed and scalable approach to resource scheduling on the grid. I dedicate this work to Emily and Maggie, who have made it possible, and made it purposeful. 9 What has been will be again, what has been done will be done again; there is nothing new under the sun. 10 Is there anything of which one can say, Look! This is something new? It was here already, long ago; it was here before our time. 11 There is no remembrance of men of old, and even those who are yet to come will not be remembered by those who follow.

[1]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[2]  David S. Rosenblum,et al.  A design framework for Internet-scale event observation and notification , 1997, ESEC '97/FSE-5.

[3]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[4]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[5]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[6]  Andrei Tsaregorodtsev,et al.  Dirac Workload Management System , 2006 .

[7]  Dario Barberis,et al.  The ATLAS computing model , 2008 .

[8]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[9]  Steven Tuecke,et al.  Internet X.509 Public Key Infrastructure (PKI) Proxy Certificate Profile , 2004, RFC.

[10]  David W. Chadwick,et al.  The PERMIS X.509 role based privilege management infrastructure , 2002, SACMAT '02.

[11]  Steve Loughran,et al.  Configuration Description, Deployment, and Lifecycle Management (CDDLM) Foundation Document , 2005 .

[12]  李幼升,et al.  Ph , 1989 .

[13]  Angelos D. Keromytis,et al.  DSA and RSA Key and Signature Encoding for the KeyNote Trust Management System , 2000, RFC.

[14]  Jakub T. Moscicki,et al.  The ganga user interface for physics analysis and distributed resources , 2004 .

[15]  Martin T. Dove,et al.  CamGrid: Experiences in constructing a university-wide, Condor-based grid at the University of Cambridge , 2008 .

[16]  A et al Tsaregorodtsev,et al.  Dirac - distributed implementation with remote agent control , 2003 .

[17]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[18]  John P. Brodholt,et al.  Leveraging HTC for UK eScience with very large Condor pools: demand for transforming untapped power into results , 2004 .

[19]  Kerstin Kleese van Dam,et al.  Grid tool integration within the eMinerals Project , 2004 .

[20]  Tatu Ylönen,et al.  The Secure Shell (SSH) Connection Protocol , 2006, RFC.

[21]  Rajesh Raman,et al.  Matchmaking: distributed resource management for high throughput computing , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[22]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[23]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[24]  Russ Housley,et al.  Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile , 2002, RFC.

[25]  Iosif Legrand,et al.  Models Of Networked Analysis At Regional Centres For Lhc Experiments (monarc), Phase 2 Report, 24th March 2000 , 2000 .

[26]  Rajesh Raman,et al.  Resource management through multilateral matchmaking , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[27]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[28]  Ian Stokes-Rees,et al.  DIRAC: a scalable lightweight architecture for high throughput computing , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[29]  Claudio Grandi,et al.  The CMS Computing Model , 2004 .

[30]  William L. Maxwell,et al.  Theory of scheduling , 1967 .

[31]  Stuart Keble Paterson LHCb distributed data analysis on the computing grid , 2006 .

[32]  Pau Klein,et al.  San Francisco, California , 2007 .

[33]  Miron Livny,et al.  Mechanisms for High Throughput Computing , 1997 .

[34]  Anura Gurugé,et al.  Universal Description, Discovery, and Integration , 2004 .

[35]  Ákos Frohner,et al.  VOMS, an Authorization System for Virtual Organizations , 2003, European Across Grids Conference.

[36]  Werner Nutt,et al.  Relational Grid Monitoring Architecture (R-GMA) , 2003, ArXiv.

[37]  David D. Clark,et al.  The design philosophy of the DARPA internet protocols , 1988, SIGCOMM '88.

[38]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[39]  Bruce Beckles Implementing privilege separation in the Condor system , 2005 .

[40]  Thomas L. Casavant,et al.  A Taxonomy of Scheduling in General-Purpose Distributed Computing Systems , 1988, IEEE Trans. Software Eng..

[41]  Robert Piro,et al.  Cream: a Simple, Grid-accessible, Job Management System for Local Computational Resources , 2006 .

[42]  Andrei Tsaregorodtsev,et al.  DIRAC Lightweight Information and Monitoring Services using XML-RPC and Instant Messaging , 2004 .

[43]  Vincent Garonne Etude, définition et modélisation d'un système distribué à grande échelle : DIRAC - Distributed infrastructure with remote agent control , 2005 .

[44]  Chuang Liu,et al.  A constraint language approach to matchmaking , 2004, 14th International Workshop Research Issues on Data Engineering: Web Services for e-Commerce and e-Government Applications, 2004. Proceedings..

[45]  Roy Fielding RFC 2068 : Hypertext Transfer Protocol-HTTP/1.1 , 1997 .

[46]  Henri Casanova,et al.  Simgrid: a toolkit for the simulation of application scheduling , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[47]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[48]  Wolfgang Hoschek The Web Service Discovery Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[49]  M. Malik,et al.  Operating Systems , 1992, Lecture Notes in Computer Science.

[50]  Paul Watson,et al.  A Grid Application Framework based on Web Services Specifications and Practices , 2004 .

[51]  Alexander L. Wolf,et al.  Acm Sigsoft Software Engineering Notes Vol 17 No 4 Foundations for the Study of Software Architecture , 2022 .

[52]  Ian Stokes-Rees,et al.  Developing LHCb Grid software: experiences and advances , 2007, Concurr. Comput. Pract. Exp..

[53]  Daniel Kouřil,et al.  Practical approaches to Grid workload and resource management in the EGEE project , 2004 .

[54]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[55]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[56]  Roscoe Giles,et al.  Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, USA, November 16-22, 2002, CD-ROM , 2002, SC.

[57]  Miron Livny,et al.  Stork: making data placement a first class citizen in the grid , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[58]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[59]  Marek Chrobak,et al.  The complexity of mean flow time scheduling problems with release times , 2006, J. Sched..

[60]  Ramin Yahyapour,et al.  Design and evaluation of job scheduling strategies for grid computing , 2000, GRID.

[61]  Rajesh Raman,et al.  Matchmaking frameworks for distributed resource management , 2000 .

[62]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[63]  Abraham Silberschatz,et al.  Operating System Concepts 7th Edition with Java 7th Edition , 2006 .

[64]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[65]  Dominique Breton,et al.  LHCb computing : Technical Design Report , 2005 .

[66]  Bruce Beckles Building a secure Condor ® pool in an open academic environment , 2005 .

[67]  David Cooper,et al.  Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile , 2008, RFC.