Scheduling task parallel applications for rapid turnaround on desktop grids

Since the early 1990's, the largest distributed computing systems in the world have been desktop grids, which use the idle cycles of mainly desktop PC's to support large scale computation. Despite the enormous computing power offered by such systems, the range of supportable applications is largely limited to task parallel, compute-bound, and high-throughput applications. This limitation is mainly because of the heterogeneity and volatility of the underlying resources, which are shared with the desktop users. Our work focuses on broadening the applications supportable by desktop grids, and in particular, we focus on the development of scheduling heuristics to enable rapid turnaround for short-lived applications. To that end, the contributions of this dissertation are as follows. First, we measure and characterize four real enterprise desktop grid systems; such characterization is essential for accurate modelling and simulation. Second, using the characterization, we design scheduling heuristics that enable rapid application turnaround. These heuristics are based on three scheduling techniques, namely resource prioritization, resource exclusion, and task replication. We find that our best heuristic uses relatively static resource information for prioritization and exclusion, and reactive task replication to achieve performance within a factor of 1.7 of optimal. Third, we implement our best heuristic in a real desktop grid system to demonstrate its feasibility.

[1]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[2]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[3]  Miron Livny,et al.  A worldwide flock of Condors: Load sharing among workstation clusters , 1996, Future Gener. Comput. Syst..

[4]  Gilles Fedak,et al.  Augernome & XtremWeb: Monte Carlos computation on a global computing platform , 2003, ArXiv.

[5]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[6]  Yair Amir,et al.  Evaluating quorum systems over the Internet , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[7]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[8]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[9]  David P. Anderson,et al.  A new major SETI project based on Project Serendip data and 100 , 1997 .

[10]  Mario Lauria,et al.  The organic grid: self-organizing computation on a peer-to-peer network , 2004 .

[11]  Amin Vahdat,et al.  Design and implementation tradeoffs for wide-area resource discovery , 2005, HPDC.

[12]  Yaohang Li,et al.  Improving performance via computational replication on a large-scale computational grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[13]  Jacky C. Chu,et al.  Availability and locality measurements of peer-to-peer file systems , 2002, SPIE ITCom.

[14]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[15]  Daniel Nurmi,et al.  Quantifying Machine Availability in Networked and Desktop Grid Systems , 2004 .

[16]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[17]  Scott T. Leutenegger,et al.  Improving Speedup and Response Times by Replicating Parallel Programs on a SNOW , 2004, JSSPP.

[18]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[19]  R. Wolski,et al.  Predicting the CPU availability of time‐shared Unix systems on the computational grid , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[20]  V. Pande,et al.  Structural correspondence between the α-helix and the random-flight chain resolves how unfolded proteins can have native-like properties , 2003, Nature Structural Biology.

[21]  Vijay S. Pande,et al.  Empirical force‐field assessment: The interplay between backbone torsions and noncovalent term scaling , 2005, J. Comput. Chem..

[22]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[23]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[24]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[25]  Gilles Fedak,et al.  XtremWeb & Condor : sharing resources between Internet connected Condor pool , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[26]  Timothy L. Harris,et al.  XenoSearch: distributed resource discovery in the XenoServer open platform , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[27]  Miron Livny,et al.  The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[28]  Peter A. Dinda,et al.  The statistical properties of host load , 1999, Sci. Program..

[29]  Peter A. Dinda,et al.  A prediction-based real-time scheduling advisor , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[30]  John F. Shoch,et al.  The “worm” programs—early experience with a distributed computation , 1982, CACM.

[31]  Satoshi Hirano,et al.  Bayanihan: building and studying web-based volunteer computing systems using Java , 1999, Future Gener. Comput. Syst..

[32]  Amin Vahdat,et al.  Distributed Resource Discovery on PlanetLab with SWORD , 2004, WORLDS.

[33]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[34]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[35]  David P. Anderson,et al.  Homogeneous redundancy: a technique to ensure integrity of molecular simulation results using public computing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[36]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[37]  Matt W. Mutka,et al.  Considering Deadline Constraints When Allocating the Shared Capacity of Private Workstations , 1994, Int. J. Comput. Simul..

[38]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[39]  Andrew A. Chien,et al.  Efficient resource description and high quality selection for virtual grids , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[40]  Peter A. Dinda,et al.  Online Prediction of the Running Time of Tasks , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[41]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[42]  S. T. Leutenegger,et al.  Distributed computing feasibility in a non-dedicated homogeneous distributed system , 1993, Supercomputing '93.

[43]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[44]  D. Nurmi Model-Based Checkpoint Scheduling for Volatile Resource Environments , 2004 .

[45]  Larry Carter,et al.  GUARD: gossip used for autonomous resource detection , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[46]  Peter A. Dinda,et al.  Preliminary Report on the Design of a Framework for Distributed Visualization , 1999, PDPTA.

[47]  Daniel Zappala,et al.  Cluster Computing on the Fly : P 2 P Scheduling of Idle Cycles in the Internet , 2004 .

[48]  T. Johnson,et al.  Finding Idle Periods on Networks of WorkstationsNYU , 1998 .

[49]  Eric J. Sorin,et al.  Exploring the helix-coil transition via all-atom equilibrium ensemble simulations. , 2005, Biophysical journal.

[50]  Luis F. G. Sarmenta,et al.  Sabotage-tolerance mechanisms for volunteer computing systems , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[51]  Joel H. Saltz,et al.  The utility of exploiting idle workstations for parallel computation , 1997, SIGMETRICS '97.

[52]  Peter A. Dinda,et al.  An evaluation of linear models for host load prediction , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[53]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[54]  Miron Livny,et al.  Recovering internet symmetry in distributed computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[55]  Charles L. Brooks,et al.  Predictor@Home: a "protein structure prediction supercomputer" based on public-resource computing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[56]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[57]  Andrew A. Chien,et al.  Entropia: architecture and performance of an enterprise desktop grid system , 2003, J. Parallel Distributed Comput..

[58]  David Abramson,et al.  High performance parametric modeling with Nimrod/G: killer application for the global grid? , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[59]  F. Cappello,et al.  Performance evaluation of Sandboxing techniques for Peer-to-Peer Computing , 2022 .

[60]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[61]  Ju Wang,et al.  The entropia virtual machine for desktop grids , 2005, VEE '05.

[62]  Peter R. Cappello,et al.  Javelin: Internet‐based parallel computing using Java , 1997 .

[63]  Jeffrey K. Hollingsworth,et al.  Exploiting idle cycles in networks of workstations , 2001 .