Automatic node selection for high performance applications on networks

A central problem in executing performance critical parallel and distributed applications on shared networks is the selection of computation nodes and communication paths for execution. Automatic selection of nodes is complex as the best choice depends on the application structure as well as the expected availability of computation and communication resources. This paper presents a solution to this problem for realistic application and network scenarios. A new algorithm to jointly analyze computation and communication resources for different application demands is introduced and a framework for automatic node selection is developed on top of Remos, which is a query interface to network information. The paper reports results from a set of applications, including Airshed pollution modeling and magnetic resonance imaging, executing on a high speed network testbed. The results demonstrate that node selection is effective in enhancing application performance in the presence of computation load as well as network traffic. Under the network conditions used for experiments, the increase in execution time due to compute loads and network congestion was reduced by half with node selection. The node selection algorithms developed in this research are also applicable to dynamic migration of long running jobs.

[1]  Sandhya Dwarkadas,et al.  Languages, Compilers, and Run-Time Systems for Scalable Computers , 2000, Lecture Notes in Computer Science.

[2]  Viktor K. Prasanna,et al.  Adaptive Communication Algorithms for Distributed Heterogeneous Systems , 1999, J. Parallel Distributed Comput..

[3]  David R. O'Hallaron,et al.  Languages, Compilers and Run-Time Systems for Scalable Computers , 1998, Springer US.

[4]  Srinivasan Seshan,et al.  SPAND: Shared Passive Network Performance Discovery , 1997, USENIX Symposium on Internet Technologies and Systems.

[5]  Jaspal Subhlok,et al.  Optimal latency-throughput tradeoffs for data parallel pipelines , 1996, SPAA '96.

[6]  C. Genovese,et al.  Functional Imaging Analysis Software — Computational Olio , 1996 .

[7]  Thomas Fahringer,et al.  Automatic performance prediction to support parallelization of Fortran programs for massively parallel systems , 1992, ICS '92.

[8]  Calton Pu,et al.  System support for mobile multimedia applications , 1997, Proceedings of 7th International Workshop on Network and Operating System Support for Digital Audio and Video (NOSSDAV '97).

[9]  Thomas R. Gross,et al.  Adaptive distributed applications on heterogeneous networks , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[10]  Thomas R. Gross,et al.  A Framework-Based Approach to the Development of Network-Aware Applications , 1998, IEEE Trans. Software Eng..

[11]  Joel H. Saltz,et al.  Run-time and compile-time support for adaptive irregular problems , 1994, Proceedings of Supercomputing '94.

[12]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[13]  Sally Floyd,et al.  Wide area traffic: the failure of Poisson modeling , 1995, TNET.

[14]  Thomas R. Gross,et al.  ReMoS: A Resource Monitoring System for Network-Aware Applications , 1997 .

[15]  Francine Berman,et al.  Performance prediction in production environments , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[16]  Sally Floyd,et al.  Why we don't know how to simulate the Internet , 1997, WSC '97.

[17]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[18]  Thomas R. Gross,et al.  Transparent adaptive parallelism on NOWs using OpenMP , 1999, PPoPP '99.

[19]  Hongsuda Tangmunarunkit,et al.  Network-Aware Distributed Computing: A Case Study , 1998, IPPS/SPDP Workshops.

[20]  Jon B. Weissman,et al.  Metascheduling: a scheduling model for metacomputing systems , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[21]  Peter Steenkiste,et al.  Airshed pollution modeling: a case study in application development in an HPF environment , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[22]  Dean Sutherland,et al.  A resource query interface for network-aware applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[23]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[24]  Peter A. Dinda The Statistical Properties of Hoast Load , 1998, LCR.

[25]  Francine Berman,et al.  Application-Level Scheduling on Distributed Heterogeneous Networks , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[26]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[27]  Richard Wolski,et al.  Implementing a Performance Forecasting System for Metacomputing The Network Weather Service , 1997, ACM/IEEE SC 1997 Conference (SC'97).