Performance Modeling Based Scheduling and Rescheduling of Parallel Applications on Computational Grids

Abstract As computational grids have become popular and ubiquitous, users have access to large num-ber and different types of geographically distributed grid resources. Many computational gridframeworks are composed of multiple distributed sites with each site consisting of one or morededicated or non-dedicated clusters. Jobs submitted to a grid are handled by a mataschedulerwhich interacts with the local schedulers of the clusters for scheduling jobs to the individualclusters. Computational grids have been found to be powerful research-beds for execution ofvarious kinds of parallel applications. When a parallel application is submitted to a grid, themetascheduler has to choose a set of resources from a cluster for application execution. Toselect the best set of resources for application execution, it is important to determine the per-formance of the application. Accurate performance estimates of an application is essential inassisting a grid meta scheduler to efciently schedule user jobs.Thus models that predict execution times of parallel applications on a set of resources anda search procedure (scheduling strategy) which selects the best set of machines within a clusterfor application execution are of importance for enabling the parallel applications on grids. Forefcient execution of large scientic parallel applications consisting of multiple phases, per-formance models of the individual phases should be obtained. Efcient rescheduling strategiesthat can use the per-phase models to adapt the parallel applications to application and resourcedynamics are necessary for maintaining high performance of the applications on grids. A prac-tical and robust grid computing infrastructure that integrates components related to applicationand resource monitoring, performance modeling, scheduling and rescheduling techniques, ishighly essential for large-scale deployment and high performance of scientic applications ongrid systems and hence for fostering high performance computing.4

[1]  Francine Berman,et al.  Performance prediction in production environments , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[2]  Cheng-Zhong Xu,et al.  Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines , 2006, J. Parallel Distributed Comput..

[3]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[4]  Mauricio Hanzich,et al.  MetaLoRaS: A Re-scheduling and Prediction MetaScheduler for Non-dedicated Multiclusters , 2007, PVM/MPI.

[5]  Srinidhi Varadarajan,et al.  DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[6]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[7]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[8]  Eduardo Huedo,et al.  Grid Resource Selection for Opportunistic Job Migration , 2003, Euro-Par.

[9]  Alexander Reinefeld,et al.  MARS - A framework for minimizing the job execution time in a metacomputing environment , 1996, Future Gener. Comput. Syst..

[10]  Henri Casanova,et al.  Practical divisible load scheduling on grid platforms with APST-DV , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[11]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[12]  Francine Berman,et al.  The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  Boleslaw K. Szymanski,et al.  The Internet Operating System: Middleware for Adaptive Distributed Computing , 2006, Int. J. High Perform. Comput. Appl..

[14]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[15]  Boleslaw K. Szymanski,et al.  Dynamic Malleability in Iterative MPI Applications , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[16]  Gilbert Poulard,et al.  Large-Scale ATLAS Simulated Production on EGEE , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[17]  Weisong Shi,et al.  An Adaptive Rescheduling Strategy for Grid Workflow Applications , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Chen Ding,et al.  Predicting locality phases for dynamic memory optimization , 2007, J. Parallel Distributed Comput..

[19]  Sathish S. Vadhiyar,et al.  Self adaptivity in Grid computing , 2005, Concurr. Pract. Exp..

[20]  YONG YAN,et al.  An Effective and Practical Performance Prediction Model for Parallel Computing on Nondedicated Heterogeneous NOW , 1996, J. Parallel Distributed Comput..

[21]  Xiaodong Zhang,et al.  Erratum: "An Effective and Practical Performance Prediction Model for Parallel Computing on Nondedicated Heterogeneous NOW" , 1997, J. Parallel Distributed Comput..

[22]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[23]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[24]  Yang Gao,et al.  Adaptive grid job scheduling with genetic algorithms , 2005, Future Gener. Comput. Syst..

[25]  Mahen Jayawardena,et al.  Grid-Enabling an Efficient Algorithm for Demanding Global Optimization Problems in Genetic Analysis , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[26]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[27]  James E. Smith,et al.  Comparing Program Phase Detection Techniques , 2003, MICRO.

[28]  Dae-Won Lee,et al.  A resource management and fault tolerance services in grid computing , 2005, J. Parallel Distributed Comput..

[29]  Alioune Ngom,et al.  Genetic algorithm based scheduler for computational grids , 2005, 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05).

[30]  G. Allen,et al.  Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[31]  Suchuan Dong,et al.  Grid solutions for biological and physical cross-site simulations on the TeraGrid , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[32]  Jason Maassen,et al.  Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[33]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[34]  Francine Berman,et al.  Adaptive Computing on the Grid Using AppLeS , 2003, IEEE Trans. Parallel Distributed Syst..

[35]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[36]  Carlos A. Varela,et al.  Malleable applications for scalable high performance computing , 2007, Cluster Computing.

[37]  Francine Berman,et al.  A Decoupled Scheduling Approach for the GrADS Program Development Environment , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[38]  Ladislau Bölöni,et al.  A comparison study of static mapping heuristics for a class of meta-tasks on heterogeneous computing systems , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[39]  Matthias S. Müller,et al.  A global grid for analysis of arthropod evolution , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[40]  S. K. Nandy,et al.  A Framework for QoS Adaptive Grid Meta Scheduling , 2005, 16th International Workshop on Database and Expert Systems Applications (DEXA'05).

[41]  Kwang Mong Sim,et al.  Ant colony optimization for routing and load-balancing: survey and new directions , 2003, IEEE Trans. Syst. Man Cybern. Part A.

[42]  Charles L. Brooks,et al.  Predictor@Home: a "protein structure prediction supercomputer" based on public-resource computing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[43]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[44]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[45]  Akshai K. Aggarwal,et al.  An adaptive generalized scheduler for grid applications , 2005, 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05).

[46]  Sathish S. Vadhiyar,et al.  Numerical Libraries And The Grid: The GrADS Experiments With ScaLAPACK , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[47]  Ahmed Al-Ani,et al.  Feature Subset Selection Using Ant Colony Optimization , 2008 .

[48]  Jesús Labarta,et al.  A Framework for Performance Modeling and Prediction , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[49]  Chen Ding,et al.  Analysis of input-dependent program behavior using active profiling , 2007, ExpCS '07.

[50]  Jack J. Dongarra,et al.  Experiments with Scheduling Using Simulated Annealing in a Grid Environment , 2002, GRID.

[51]  Larry Carter,et al.  Centralized versus distributed schedulers for multiple bag-of-task applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[52]  Sathish S. Vadhiyar,et al.  A performance oriented migration framework for the grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[53]  Xin Li,et al.  Prophesy: automating the modeling process , 2001, Proceedings Third Annual International Workshop on Active Middleware Services.

[54]  Chengbin Chu,et al.  A branch and bound algorithm to minimize total weighted completion time on identical parallel machines with job release dates , 2006, 2006 International Conference on Service Systems and Service Management.

[55]  Priya Vashishta,et al.  Sustainable Adaptive Grid Supercomputing: Multiscale Simulation of Semiconductor Processing across the Pacific , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[56]  Cheng-Zhong Xu,et al.  Service migration in distributed virtual machines for adaptive grid computing , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[57]  Jennifer M. Schopf,et al.  A performance study of monitoring and information services for distributed systems , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[58]  Michael Laurenzano,et al.  How well can simple metrics represent the performance of HPC applications? , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[59]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[60]  Sato Hiroyuki,et al.  A resource-oriented grid meta-scheduler based on agents , 2007 .

[61]  Gurdeep S. Hura,et al.  Non-evolutionary algorithm for scheduling dependent tasks in distributed heterogeneous computing environments , 2005, J. Parallel Distributed Comput..

[62]  Jon B. Weissman,et al.  A genetic algorithm based approach for scheduling decomposable data grid applications , 2004 .

[63]  Henri Casanova,et al.  Multiround algorithms for scheduling divisible loads , 2005, IEEE Transactions on Parallel and Distributed Systems.

[64]  Jemal H. Abawajy,et al.  Parallel job scheduling on multicluster computing system , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[65]  Laxmikant V. Kalé,et al.  Performance evaluation of adaptive MPI , 2006, PPoPP '06.

[66]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[67]  Jason Maassen,et al.  Self-adaptive applications on the grid , 2007, PPoPP.

[68]  Peter M. A. Sloot,et al.  Dynamite - Blasting Obstacles to Parallel Cluster Computing , 1999, HPCN Europe.

[69]  Chirag M. Patel,et al.  An ant-based algorithm for coloring graphs , 2008, Discret. Appl. Math..

[70]  Stephen A. Jarvis,et al.  Hybrid Performance-Oriented Scheduling of Moldable Jobs with QoS Demands in Multiclusters and Grids , 2004, GCC.

[71]  Jyh-Biau Chang,et al.  A multi-layer resource reconfiguration framework for grid computing , 2006, MCG '06.

[72]  Gagan Agrawal,et al.  Supporting Dynamic Migration in Tightly Coupled Grid Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[73]  A. Lumsdaine,et al.  High-Performance Direct Pairwise Comparison of Large Genomic Sequences , 2006, IEEE Trans. Parallel Distributed Syst..

[74]  Denis Trystram,et al.  Scheduling parallel applications using malleable tasks on clusters , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[75]  Lúcia Maria de A. Drummond,et al.  A grid-enabled distributed branch-and-bound algorithm with application on the Steiner Problem in graphs , 2006, Parallel Comput..

[76]  Francine Berman,et al.  Using Stochastic Information to Predict Application Behavior on Contended Resources , 2001, Int. J. Found. Comput. Sci..

[77]  Micah Beck,et al.  The Internet Backplane Protocol: Storage in the Network , 1999 .

[78]  Jarek Nabrzyski,et al.  Dynamic grid scheduling with job migration and rescheduling in the GridLab resource management system , 2004, Sci. Program..

[79]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[80]  Xingfu Wu,et al.  Using kernel couplings to predict parallel application performance , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[81]  Graham R. Nudd,et al.  Dynamic Instrumentation and Performance Prediction of Application Execution , 2001, HPCN Europe.

[82]  Anca I. D. Bucur,et al.  The Influence of the Structure and Sizes of Jobs on the Performance of Co-allocation , 2000, JSSPP.

[83]  Brad Calder,et al.  Detecting phases in parallel applications on shared memory architectures , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[84]  Ian T. Foster Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, NPC.

[85]  Charles L. Brooks,et al.  Predictor@Home: A "Protein Structure Prediction Supercomputer' Based on Global Computing , 2006, IEEE Transactions on Parallel and Distributed Systems.

[86]  Ian T. Foster,et al.  The Nexus Approach to Integrating Multithreading and Communication , 1996, J. Parallel Distributed Comput..

[87]  Mark J. Clement,et al.  Automated Performance Prediction for Scalable Parallel Computing , 1997, Parallel Comput..

[88]  Sally A. McKee,et al.  An Approach to Performance Prediction for Parallel Applications , 2005, Euro-Par.

[89]  Sathish S. Vadhiyar,et al.  Performance Modeling based on Multidimensional Surface Learning for Performance Predictions of Parallel Applications in Non-Dedicated Environments , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[90]  Xin Zhao,et al.  Scheduling parallel applications in distributed networks , 2004, Cluster Computing.

[91]  W Chiu,et al.  EMAN: semiautomated software for high-resolution single-particle reconstructions. , 1999, Journal of structural biology.

[92]  Christian Blum,et al.  An Ant Colony Optimization Algorithm for Shop Scheduling Problems , 2004, J. Math. Model. Algorithms.

[93]  Dick H. J. Epema,et al.  Scheduling malleable applications in multicluster systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[94]  Rajesh Sudarsan,et al.  ReSHAPE: A Framework for Dynamic Resizing and Scheduling of Homogeneous Applications in a Parallel Environment , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[95]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[96]  Sally A. McKee,et al.  Predicting parallel application performance via machine learning approaches , 2007, Concurr. Comput. Pract. Exp..

[97]  P. Sadayappan,et al.  Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environement , 2003, JSSPP.

[98]  Richard O. Sinnott,et al.  Towards a Grid-Enabled Simulation Framework for Nano-CMOS Electronics , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[99]  Sathish S. Vadhiyar,et al.  SRS: A Framework for Developing Malleable and Migratable Parallel Applications for Distributed Systems , 2003, Parallel Process. Lett..

[100]  Rizos Sakellariou,et al.  A low-cost rescheduling policy for efficient mapping of workflows on grid systems , 2004, Sci. Program..

[101]  Kenichi Hagihara,et al.  Near-optimal dynamic task scheduling of independent coarse-grained tasks onto a computational grid , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[102]  Josef Kittler,et al.  Fast branch & bound algorithms for optimal feature selection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[103]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[104]  Xingfu Wu,et al.  Prophesy: an infrastructure for performance analysis and modeling of parallel and grid applications , 2003, PERV.

[105]  R. Wolski,et al.  GridSAT: A Chaff-based Distributed SAT Solver for the Grid , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[106]  Michael C. Huang,et al.  Program phase detection and exploitation , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[107]  Jean-Marc Geib,et al.  Scheduling parallel adaptive applications in networks of workstations and clusters of processors , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[108]  Cosimo Anglano,et al.  Predicting parallel applications performance on non-dedicated cluster platforms , 1998, ICS '98.

[109]  Lin Sun,et al.  Semi-Empirical Multiprocessor Performance Predictions , 1996, J. Parallel Distributed Comput..

[110]  John Hallam,et al.  Combining Regression Trees and Radial Basis Function Networks , 2000, Int. J. Neural Syst..

[111]  Sathish S. Vadhiyar,et al.  Performance modeling of parallel applications for grid scheduling , 2008, J. Parallel Distributed Comput..

[112]  Mikel Luján,et al.  Adaptive performance control for distributed scientific coupled models , 2007, ICS '07.

[113]  R.J. Block,et al.  Automated Performance Prediction of Message-Passing Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[114]  B. F. Spencer,et al.  Distributed hybrid earthquake engineering experiments: experiences with a ground-shaking grid application , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[115]  Tony Pan,et al.  Image processing for the grid: a toolkit for building grid-enabled image processing applications , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[116]  Stephen A. Jarvis,et al.  Dynamic scheduling of parallel jobs with QoS demands in multiclusters and grids , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[117]  Francine Berman,et al.  New Grid Scheduling and Rescheduling Methods in the GrADS Project , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[118]  Keshav Pingali,et al.  Mobile MPI programs in computational grids , 2006, PPoPP '06.

[119]  Marco Dorigo,et al.  Ant algorithms and stigmergy , 2000, Future Gener. Comput. Syst..

[120]  Keqin Li,et al.  Job scheduling and processor allocation for grid computing on metacomputers , 2005, J. Parallel Distributed Comput..

[121]  Duncan A. Grove,et al.  Modeling message-passing programs with a Performance Evaluating Virtual Parallel Machine , 2005, Perform. Evaluation.

[122]  Anthony A. Maciejewski,et al.  Robust static allocation of resources for independent tasks under makespan and dollar cost constraints , 2007, J. Parallel Distributed Comput..

[123]  Gregory A. Koenig,et al.  Optimizing Distributed Application Performance Using Dynamic Grid Topology-Aware Load Balancing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[124]  Peter A. Dinda Online prediction of the running time of tasks , 2001, SIGMETRICS '01.

[125]  Wu-chun Feng,et al.  Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[126]  Stergios V. Anastasiadis,et al.  Parallel Application Scheduling on Networks of Workstations , 1997, J. Parallel Distributed Comput..

[127]  Ian T. Foster,et al.  Security for Grid services , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[128]  Sathish S. Vadhiyar,et al.  GrADSolve a grid-based RPC system for parallel computing with application-level scheduling , 2004, J. Parallel Distributed Comput..

[129]  Francine Berman,et al.  The AppLeS Project: A Status Report , 1997 .

[130]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .