Integration and Evaluation of Decentralized Fairshare Prioritization (Aequus)

Fairshare is commonly one of the factors used by cluster resource management systems to prioritize jobs during scheduling. Despite the grid vision of a transparent and unified infrastructure, fairshare is normally calculated and enforced at the local cluster level rather than at a grid-wide scale. Aequus is a self-contained decentralized system for grid-wide fairshare job prioritization. Using Aequus, detailed global share policies can be combined with local cluster policies to offer a unified grid fairshare prioritization system where local administrations retain control over their clusters. This work shows how Aequus can be integrated with local resource management systems such as SLURM and Maui with minimal intrusion. Early results from production help assess the maturity of the system, and the system is further tested and evaluated for use at a nation-wide scale using workload modeling techniques. Statistical models are created based on historical national grid usage data, and synthetic traces based on these models are used to create a diverse input set used to exemplify system behavior. The system is shown to behave consistently despite great variations in job arrival patterns and partial participation of some of the collaborating installations.

[1]  Erik Elmroth,et al.  Design and evaluation of a decentralized system for grid-wide fairshare scheduling , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[2]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[3]  Martin F. Arlitt,et al.  Web server workload characterization: the search for invariants , 1996, SIGMETRICS '96.

[4]  Allen B. Downey,et al.  The elusive goal of workload characterization , 1999, PERV.

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[7]  emontmej,et al.  High Performance Computing , 2003, Lecture Notes in Computer Science.

[8]  Johan Tordsson,et al.  An interoperable, standards-based grid resource broker and job submission service , 2005, First International Conference on e-Science and Grid Computing (e-Science'05).

[9]  Erik Elmroth,et al.  Decentralized scalable fairshare scheduling , 2013, Future Gener. Comput. Syst..

[10]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[11]  Dror G. Feitelson,et al.  Workload Modeling for Computer Systems Performance Evaluation , 2015 .

[12]  Carsten Franke,et al.  On Grid Performance Evaluation Using Synthetic Workloads , 2006, JSSPP.

[13]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[15]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[16]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[17]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[18]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[19]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[20]  Paul Barford,et al.  Generating representative Web workloads for network and server performance evaluation , 1998, SIGMETRICS '98/PERFORMANCE '98.

[21]  Carey L. Williamson,et al.  Internet Web servers: workload characterization and performance implications , 1997, TNET.

[22]  Eduardo Huedo,et al.  The GridWay Framework for Adaptive Scheduling and Execution on Grids , 2001, Scalable Comput. Pract. Exp..

[23]  Erik Elmroth,et al.  Decentralized Prioritization-Based Management Systems for Distributed Computing , 2013, 2013 IEEE 9th International Conference on e-Science.

[24]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[25]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.