Evaluation of strategies to reduce the impact of machine reclaim in cycle-stealing environments

We investigate the scheduling problem that arises in parallel applications executing on a network of machines by using a mode of cycle-stealing. In this mode of execution a parallel application executes its tasks in several machines whenever they are idle. When the user reclaims the machine, tasks must relinquish control immediately. In this case, the parallel application has the risk of losing work in progress on reclaimed machines and, therefore, the total execution time of the parallel application will be affected by the need for rescheduling the pre-empted task. We first evaluate the impact on the performance of an application when it runs on two different scenarios: a set of N dedicated machines, and a set of N non-dedicated machines (in which pre-emption may occur). This study shows that losing machines may have a considerable impact on the execution time of the application and therefore, we propose and evaluate three simple strategies to alleviate this problem. All strategies are based on the use of additional machines, but they differ in the way that these extra machines are used. In the first strategy additional machines are added to the common pool of machines used by the application. The other two are based on task replication, in which the additional machines are used to execute certain tasks that are already running in other machines.

[1]  David Abramson,et al.  High performance parametric modeling with Nimrod/G: killer application for the global grid? , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[2]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[3]  Gary Shao Performance Efiects of Scheduling Strategies for Master/Slave Distributed Applications , 1998 .

[4]  Rajesh Raman,et al.  High Throughput Monte Carlo , 1999, PPSC.

[5]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[6]  Nicholas Carriero,et al.  Adaptive Parallelism and Piranha , 1995, Computer.

[7]  Arnold L. Rosenberg,et al.  An Optimal Strategies for Cycle-Stealing in Networks of Workstations , 1997, IEEE Trans. Computers.

[8]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[9]  Henri Casanova,et al.  Adaptive Scheduling for Task Farming with Grid Middleware , 1999, Euro-Par.

[10]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[11]  Miron Livny,et al.  Evaluation of an Adaptive Scheduling Strategy for Master-Worker Applications on Clusters of Workstations , 2000, HiPC.

[12]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[13]  Xian-He Sun,et al.  Limitations of Cycle Stealing for Parallel Processing on a Network of Homogeneous Workstations , 1997, J. Parallel Distributed Comput..

[14]  Miron Livny,et al.  Interfacing Condor and PVM to harness the cycles of workstation clusters , 1996, Future Gener. Comput. Syst..

[15]  Jeffrey K. Hollingsworth,et al.  Exploiting Fine-Grained Idle Periods in Networks of Workstations , 2000, IEEE Trans. Parallel Distributed Syst..

[16]  Miron Livny,et al.  Adaptive Scheduling for Master-Worker Applications on the Computational Grid , 2000, GRID.